Thus saith Mike Howarth: > I've already played around with differing depths generally from 3 to 10 and > have had no distinguisable difference in results >...... > Anymore ideas?
I fought with a similar problem for quite a while. I suggest changing 2 things in your nutch-site.xml The http.content.limit will prevent nutch from truncating the page. As long as your pages aren't so big that you're going to kill the machine you're using, removing the truncation should work. <property> <name>http.content.limit</name> <value>-1</value> <description>The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. </description> </property> Second, by default, nutch only crawls the first 100 links it encounters on a page. So if you set db.max.outlinks.per.page to -1, it will crawl all the links. <property> <name>db.max.outlinks.per.page</name> <value>-1</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property> I hope this helps! Ann ____________________________________________________________________________________ We won't tell. Get more on shows you hate to love (and love to hate): Yahoo! TV's Guilty Pleasures list. http://tv.yahoo.com/collections/265
