My understanding is that only up to the maximum number of outlinks are processed for a page when updating the web db. I assume the same page won't get fetched and processed again in the next fetch/update cycles, then you won't get those outlinks exceeding the maximum number no matter how many cycles you are running.

To make sure all of the outlinks are processed for a page, the db.max.outlinks.per.page must be set to a number that is larger than the number of outlinks on the page. If these is true, then the max number has to be determined in real time since the number of outlinks varies from page to page.
Is my understanding correct?

AJ


Jack Tang wrote:

Hi All

Here is the "db.max.outlinks.per.page" property and its description in
nutch-default.xml
        <property>
          <name>db.max.outlinks.per.page</name>
          <value>100</value>
          <description>The maximum number of outlinks that we'll process for a 
page.
          </description>
      </property>

I don't think the description is right.
Say, my crawler feeds are:
http://www.a.com/index.php (90 outlinks)
http://www.b.com/index.jsp  (80 outlinks)
http://www.c.com/index.html (50 outlinks)

and the number of crawler thread is 30. Do you think the reminder URLs
( (80 -10) outlinks + 50  outlinks) will be fetched?

I think the description should be "The maximum number of outlinks in
one fecthing phase."


Regards
/Jack

--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [EMAIL PROTECTED]
---------------------------------------------------

Reply via email to