Hi Canan thanks, I create an issue to fix this "count" variable problem in https://issues.apache.org/jira/browse/NUTCH-1594.
may be you can testing using bin/nutch command and crawl step by step. check the result of 3rd iteration and see what happens. hope this can help you. On Sat, Jun 29, 2013 at 10:45 PM, feng lu <[email protected]> wrote: > Hi Canan > yes, "count" variable is never changed, may be it is a bug. but you > problem may not caused by this issue, in your 3rd iteration it may cause by > fetch or parse failure so it will not generate newer outlinks. > > > On Fri, Jun 28, 2013 at 8:52 PM, Canan GİRGİN <[email protected]>wrote: > >> Hi Lewis, >> >> 'db.max.outlinks.per.page" parameter is never use in nutch 2.x source >> code. >> >> It controlled by ParseUtil class at this row: >> for (int i = 0; count < maxOutlinks && i < outlinks.length; i++) >> >> But "count" variable is never changed. >> >> Canan >> >> >> >> >> >> On Fri, Jun 28, 2013 at 2:32 PM, Jamshaid Ashraf >> <[email protected]>wrote: >> >>> Hi, >>> >>> I have followed the given link and updated 'db.max.outlinks.per.page' to >>> -1 >>> in 'nutch-default' file. >>> >>> but facing same issue while crawling ' >>> http://www.halliburton.com/en-US/default.page & cnn.com', below is the >>> last >>> line of fetcher job which shows 0 page found on 3rd or 4th iteration. >>> >>> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 >>> URLs >>> in 0 queues >>> -activeThreads=0 >>> FetcherJob: done >>> >>> Please note that when I crawl amazon & others sites it works fine. Do you >>> think is it because of some restriction of halliborton (robot.txt) or >>> some >>> misconfiguration at my end? >>> >>> Regards, >>> Jamshaid >>> >>> >>> On Fri, Jun 28, 2013 at 12:37 AM, Lewis John Mcgibbney < >>> [email protected]> wrote: >>> >>> > Hi, >>> > Can you please try this >>> > http://s.apache.org/wIC >>> > Thanks >>> > Lewis >>> > >>> > >>> > On Thu, Jun 27, 2013 at 8:01 AM, Jamshaid Ashraf < >>> [email protected] >>> > >wrote: >>> > >>> > > Hi, >>> > > >>> > > I'm using nutch 2.x with HBase and tried to crawl " >>> > > http://www.halliburton.com/en-US/default.page" site for depth level >>> 5. >>> > > >>> > > Following is the command: >>> > > >>> > > bin/crawl urls/seed.txt HB http://localhost:8080/solr/ 5 >>> > > >>> > > >>> > > It worked well till 3rd iteration but for remaining 4th and 5th >>> nothing >>> > > fetched (same case happened with cnn.com). but if i tried to crawl >>> other >>> > > sites like amazon with depth level 5 it works. >>> > > >>> > > Could you please guide what could be the reasons for failing of 4th >>> and >>> > 5th >>> > > iteration. >>> > > >>> > > >>> > > Regards, >>> > > Jamshaid >>> > > >>> > >>> > >>> > >>> > -- >>> > *Lewis* >>> > >>> >> >> > > > -- > Don't Grow Old, Grow Up... :-) > -- Don't Grow Old, Grow Up... :-)

