Thus saith Mike Howarth:
> I've already played around with differing depths generally from 3 to 10 and
> have had no distinguisable difference in results
>......
> Anymore ideas?



I fought with a similar problem for quite a while.  I suggest changing 2 things 
in your nutch-site.xml

The http.content.limit will prevent nutch from truncating the page.  As long as 
your pages aren't so big that you're going to kill the machine you're using, 
removing the truncation should work.

<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  </description>
</property>

Second, by default, nutch only crawls the first 100 links it encounters on a 
page. So if you set db.max.outlinks.per.page to -1, it will crawl all the links.

<property>
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>


I hope this helps!

Ann




 
____________________________________________________________________________________
We won't tell. Get more on shows you hate to love 
(and love to hate): Yahoo! TV's Guilty Pleasures list.
http://tv.yahoo.com/collections/265 

Reply via email to