There are probably two settings you'll need to tweak in nutch-default.xml
http.content.limit -- by default it's 64K, if the page is larger than that, then it essentially truncates the file. You could be missing lots of links that appear later in the page. max.outlinks.per.page -- by default it's 100. You might want to increase this since for pages with something like a nested navigation sidebar with tons of links, it won't get any links from the main part of the page. The *.xml files are fairly descriptive. So just reading through them can be pretty helpful. I don't know if there is a full guide to the config files. Howie
1) I did several testing running to fetch page from two website. The fetching depth is 10. After checking log files, I found the actual fetched page linkage is very different for two sites. In one site with lots of news, only first two depth fetching running well and only fetching 5 linkages. The actual linkages in that site is far beyond that. The other site can fetch till 10 rounds and fetched 100's linkage. I wonder if any one has similar experience. Should I setup configure files in /conf/? 2) Also, in Nutch/conf/ directory, I found several configuration files. Actually, I only modify crawl-urlfilter.txt to let it accept all the url (*.*). Is it proper? I really doesn't touch other conf files. Is there a guideline how I use these files? thanks, Michael, __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
