My understanding is that only up to the maximum number of outlinks are
processed for a page when updating the web db. I assume the same page
won't get fetched and processed again in the next fetch/update cycles,
then you won't get those outlinks exceeding the maximum number no matter
how many cycles you are running.
To make sure all of the outlinks are processed for a page, the
db.max.outlinks.per.page must be set to a number that is larger than the
number of outlinks on the page. If these is true, then the max number
has to be determined in real time since the number of outlinks varies
from page to page.
Is my understanding correct?
AJ
Jack Tang wrote:
Hi All
Here is the "db.max.outlinks.per.page" property and its description in
nutch-default.xml
<property>
<name>db.max.outlinks.per.page</name>
<value>100</value>
<description>The maximum number of outlinks that we'll process for a
page.
</description>
</property>
I don't think the description is right.
Say, my crawler feeds are:
http://www.a.com/index.php (90 outlinks)
http://www.b.com/index.jsp (80 outlinks)
http://www.c.com/index.html (50 outlinks)
and the number of crawler thread is 30. Do you think the reminder URLs
( (80 -10) outlinks + 50 outlinks) will be fetched?
I think the description should be "The maximum number of outlinks in
one fecthing phase."
Regards
/Jack
--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting
Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [EMAIL PROTECTED]
---------------------------------------------------