[ 
https://issues.apache.org/jira/browse/NUTCH-927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-927.
-------------------------------

    Resolution: Not A Problem

Not a bug. use the mailing lists to ask questions

> Sub pages are not getting crawled
> ---------------------------------
>
>                 Key: NUTCH-927
>                 URL: https://issues.apache.org/jira/browse/NUTCH-927
>             Project: Nutch
>          Issue Type: Bug
>          Components: injector
>    Affects Versions: 2.0
>            Reporter: Rameez Raja
>
> In my program the objective is to crawl all the pages and fetch the contents 
> from it. The category wise fetching the information is done perfectly but the 
> sub pages are not getting crawled. In the sense, the nextpages are in the 
> form of links at the bottom of the webpage as shown below - 
> <a href="http://reviews.logitech.com/7061/224/reviews.htm?page=2"; title="Next 
> Page &gt;" name="BV_TrackingTag_Review_Display_NextPage">More Reviews for 
> Z-5500 Digital 5.1 Speaker System</a>.
> I am using the below script to crawl the site.
> $NUTCH_HOME/search/scripts/crawl.sh testcrawlreviews 5 & > crawl.log
> where 5 is the depth
> Shown below is the snapshot
> cd $NUTCH_HOME
> bin/nutch inject $BASEDIR/crawldb urls
> bin/nutch generate $BASEDIR/crawldb $BASEDIR/segments
> SEGMENT=`ls $BASEDIR/segments/ | tail -1`
> echo processing segment $SEGMENT
> bin/nutch fetch $BASEDIR/segments/$SEGMENT -threads 10
> bin/nutch updatedb $BASEDIR/crawldb $BASEDIR/segments/$SEGMENT -filter
> done

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to