There are probably two settings you'll need to tweak
in nutch-default.xml

http.content.limit -- by default it's 64K, if the page is
larger than that, then it essentially truncates the file.
You could be missing lots of links that appear later in
the page.

max.outlinks.per.page -- by default it's 100. You might
want to increase this since for pages with something like
a nested navigation sidebar with tons of links, it won't
get any links from the main part of the page.

The *.xml files are fairly descriptive. So just reading through
them can be pretty helpful. I don't know if there is a full
guide to the config files.

Howie




1)
I did several testing running to fetch page from two
website. The fetching depth is 10.

After checking log files, I found the actual fetched
page linkage is very different for two sites.

In one site with lots of news, only first two depth
fetching running well and only fetching 5 linkages.
The actual linkages in that site is far beyond that.

The other site can fetch till 10 rounds and fetched
100's linkage.

I wonder if any one has similar experience. Should I
setup configure files in /conf/?

2)
Also, in Nutch/conf/ directory, I found several
configuration files. Actually, I only modify
crawl-urlfilter.txt to let it accept all the url
(*.*).

Is it proper?

I really doesn't touch other conf files. Is there a
guideline how I use these files?

thanks,

Michael,



__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com




-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to