To increase the depth of a whole web crawl you need to fetch additional rounds, then update the database with the newly fetch URLs (eventually you will also need to index these URLs along with the "homepage" URLs fetched in the first round). The following part of the tutorial details how the 2nd and 3rd round of fetching (i.e. depths 2 & 3) should occur:
Now we fetch a new segment with the top-scoring 1000 pages: bin/nutch generate db segments -topN 1000 s2=`ls -d segments/2* | tail -1` echo $s2 bin/nutch fetch $s2 bin/nutch updatedb db $s2 Let's fetch one more round: bin/nutch generate db segments -topN 1000 s3=`ls -d segments/2* | tail -1` echo $s3 bin/nutch fetch $s3 bin/nutch updatedb db $s3 On 11/16/05, Aled Jones <[EMAIL PROTECTED]> wrote: > Hi, > > I've successfully followed the nutch whole-web crawling tutorial, except > instead of using the urls in the DMOZ open directory I've created my own > list of about a 100 urls. > > However when following the rest of the tutorial the result only seems to > include the "home" pages at the urls specifed, it doesn't seem to have > done any crawling, just grabbed the home page url of each site. > How do I specify that it should crawl each website to a depth of say, 10 > pages? > > Thanks in advance, > > Regards > Aled > > > > > ************************************************************************ > This e-mail and any attachments are strictly confidential and intended solely > for the addressee. They may contain information which is covered by legal, > professional or other privilege. If you are not the intended addressee, you > must not copy the e-mail or the attachments, or use them for any purpose or > disclose their contents to any other person. To do so may be unlawful. If you > have received this transmission in error, please notify us as soon as > possible and delete the message and attachments from all places in your > computer where they are stored. > > Although we have scanned this e-mail and any attachments for viruses, it is > your responsibility to ensure that they are actually virus free. > > > > ------------------------------------------------------- This SF.Net email is sponsored by the JBoss Inc. Get Certified Today Register for a JBoss Training Course. Free Certification Exam for All Training Attendees Through End of 2005. For more info visit: http://ads.osdn.com/?ad_idv28&alloc_id845&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
