Jack,
Set the max to 100, but run 10 cycles (i.e., depth=10) with the CrawlTool. You may see all the outlinks are collected toward the end. 3 cycles is usually not enough.
-AJ

Jack Tang wrote:

Yes, Stefan.
But it missed some URLs, and I set the value to 3000, then everything is OK

/Jack

On 9/8/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
Jack,
That is max outlinks per html page.
All your example pages have less than 100 outlinks, right?!
Stefan

Am 07.09.2005 um 18:43 schrieb Jack Tang:

Hi All

Here is the "db.max.outlinks.per.page" property and its description in
nutch-default.xml
   <property>
     <name>db.max.outlinks.per.page</name>
     <value>100</value>
     <description>The maximum number of outlinks that we'll
process for a page.
     </description>
      </property>

I don't think the description is right.
Say, my crawler feeds are:
http://www.a.com/index.php (90 outlinks)
http://www.b.com/index.jsp  (80 outlinks)
http://www.c.com/index.html (50 outlinks)

and the number of crawler thread is 30. Do you think the reminder URLs
( (80 -10) outlinks + 50  outlinks) will be fetched?

I think the description should be "The maximum number of outlinks in
one fecthing phase."


Regards
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net







--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [EMAIL PROTECTED]
---------------------------------------------------

Reply via email to