Hi Feng, While I was debugging another problem, I saw this bug. Thanks for fixing it. I testted your patch and It worked fine.
Canan On Sat, Jun 29, 2013 at 6:21 PM, feng lu <[email protected]> wrote: > Hi Canan > > thanks, I create an issue to fix this "count" variable problem in > https://issues.apache.org/jira/browse/NUTCH-1594. > > may be you can testing using bin/nutch command and crawl step by step. > check the result of 3rd iteration and see what happens. > > hope this can help you. > > > On Sat, Jun 29, 2013 at 10:45 PM, feng lu <[email protected]> wrote: > >> Hi Canan >> yes, "count" variable is never changed, may be it is a bug. but you >> problem may not caused by this issue, in your 3rd iteration it may cause by >> fetch or parse failure so it will not generate newer outlinks. >> >> >> On Fri, Jun 28, 2013 at 8:52 PM, Canan GİRGİN <[email protected]>wrote: >> >>> Hi Lewis, >>> >>> 'db.max.outlinks.per.page" parameter is never use in nutch 2.x source >>> code. >>> >>> It controlled by ParseUtil class at this row: >>> for (int i = 0; count < maxOutlinks && i < outlinks.length; i++) >>> >>> But "count" variable is never changed. >>> >>> Canan >>> >>> >>> >>> >>> >>> On Fri, Jun 28, 2013 at 2:32 PM, Jamshaid Ashraf >>> <[email protected]>wrote: >>> >>>> Hi, >>>> >>>> I have followed the given link and updated 'db.max.outlinks.per.page' >>>> to -1 >>>> in 'nutch-default' file. >>>> >>>> but facing same issue while crawling ' >>>> http://www.halliburton.com/en-US/default.page & cnn.com', below is the >>>> last >>>> line of fetcher job which shows 0 page found on 3rd or 4th iteration. >>>> >>>> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 >>>> URLs >>>> in 0 queues >>>> -activeThreads=0 >>>> FetcherJob: done >>>> >>>> Please note that when I crawl amazon & others sites it works fine. Do >>>> you >>>> think is it because of some restriction of halliborton (robot.txt) or >>>> some >>>> misconfiguration at my end? >>>> >>>> Regards, >>>> Jamshaid >>>> >>>> >>>> On Fri, Jun 28, 2013 at 12:37 AM, Lewis John Mcgibbney < >>>> [email protected]> wrote: >>>> >>>> > Hi, >>>> > Can you please try this >>>> > http://s.apache.org/wIC >>>> > Thanks >>>> > Lewis >>>> > >>>> > >>>> > On Thu, Jun 27, 2013 at 8:01 AM, Jamshaid Ashraf < >>>> [email protected] >>>> > >wrote: >>>> > >>>> > > Hi, >>>> > > >>>> > > I'm using nutch 2.x with HBase and tried to crawl " >>>> > > http://www.halliburton.com/en-US/default.page" site for depth >>>> level 5. >>>> > > >>>> > > Following is the command: >>>> > > >>>> > > bin/crawl urls/seed.txt HB http://localhost:8080/solr/ 5 >>>> > > >>>> > > >>>> > > It worked well till 3rd iteration but for remaining 4th and 5th >>>> nothing >>>> > > fetched (same case happened with cnn.com). but if i tried to crawl >>>> other >>>> > > sites like amazon with depth level 5 it works. >>>> > > >>>> > > Could you please guide what could be the reasons for failing of 4th >>>> and >>>> > 5th >>>> > > iteration. >>>> > > >>>> > > >>>> > > Regards, >>>> > > Jamshaid >>>> > > >>>> > >>>> > >>>> > >>>> > -- >>>> > *Lewis* >>>> > >>>> >>> >>> >> >> >> -- >> Don't Grow Old, Grow Up... :-) >> > > > > -- > Don't Grow Old, Grow Up... :-) >

