Re: [Nutch-general] Crawl not crawling entire page

Dennis Kubes Thu, 22 Mar 2007 05:52:06 -0800

Nutch by default will only parse the first 65536 bytes of an http 
request.  You can change this to your desired limit by changing the 
http.content.limit configuration variable.


Another question is whether some of the links are duplicates?

Dennis Kubes

Mike Howarth wrote:
> Thanks for the response
> 
> I've already played around with differing depths generally from 3 to 10 and
> have had no distinguisable difference in results.
> 
> Furthermore I've tried running the search with the topN and omitting the
> flag with little difference.
> 
> Anymore ideas?
> 
> 
> Ratnesh,V2Solutions India wrote:
>> Hi , 
>> it may be because of the depth you specify is not able to reach the
>> desired page link, so you do some settings related with depth,threads at
>> the time of crawl.
>>
>> like crawl -d urldir -dir crawl-dir -depth 20 -threads 10 -topN 50
>>
>> try with increasing these values, might you get some good result.
>> and if I get some Updates regarding this,  I will let you know.
>>
>> Thanks
>>
>>
>> Mike Howarth wrote:
>>> I was wondering if anyone could help me.
>>>
>>> I'm currently trying to get nutch to crawl a site I have. At the moment
>>> I'm pointing nutch at the root url e.g http://www.example.com
>>>
>>> I know that I have over 130 links on the index page, however nutch is
>>> only finding 87 links. It appears that nutch stops crawling, the
>>> hadoop.log doesn't given any indication why this may occur.
>>>
>>> I've amended my nutch-crawl to look like this:
>>>
>>> # The url filter file used by the crawl command.
>>>
>>> # Better for intranet crawling.
>>> # Be sure to change MY.DOMAIN.NAME to your domain name.
>>>
>>> # Each non-comment, non-blank line contains a regular expression
>>> # prefixed by '+' or '-'.  The first matching pattern in the file
>>> # determines whether a URL is included or ignored.  If no pattern
>>> # matches, the URL is ignored.
>>>
>>> # skip file:, ftp:, & mailto: urls
>>> -^(file|ftp|mailto):
>>>
>>> # skip image and other suffixes we can't yet parse
>>> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|js)$
>>>
>>> # skip URLs containing certain characters as probable queries, etc.
>>> [EMAIL PROTECTED]
>>>
>>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>>> loops
>>> #-.*(/.+?)/.*?\1/.*?\1/
>>>
>>> # accept hosts in MY.DOMAIN.NAME
>>> -^https:\/\/.*
>>> +.
>>>
>>> # skip everything else
>>> #-^https://.*
>>>
>>>
>>
> 

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Crawl not crawling entire page

Reply via email to