Re: Crawl not crawling entire page

Dennis Kubes Thu, 22 Mar 2007 05:51:52 -0800

Nutch by default will only parse the first 65536 bytes of an httprequest. You can change this to your desired limit by changing thehttp.content.limit configuration variable.


Another question is whether some of the links are duplicates?


Dennis Kubes

Mike Howarth wrote:

Thanks for the response

I've already played around with differing depths generally from 3 to 10 and
have had no distinguisable difference in results.

Furthermore I've tried running the search with the topN and omitting the
flag with little difference.

Anymore ideas?


Ratnesh,V2Solutions India wrote:

Hi ,it may be because of the depth you specify is not able to reach the

desired page link, so you do some settings related with depth,threads at
the time of crawl.

like crawl -d urldir -dir crawl-dir -depth 20 -threads 10 -topN 50

try with increasing these values, might you get some good result.
and if I get some Updates regarding this,  I will let you know.

Thanks


Mike Howarth wrote:

I was wondering if anyone could help me.

I'm currently trying to get nutch to crawl a site I have. At the moment
I'm pointing nutch at the root url e.g http://www.example.com

I know that I have over 130 links on the index page, however nutch is
only finding 87 links. It appears that nutch stops crawling, the
hadoop.log doesn't given any indication why this may occur.

I've amended my nutch-crawl to look like this:

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|js)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
#-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
-^https:\/\/.*
+.

# skip everything else
#-^https://.*

Re: Crawl not crawling entire page

Reply via email to