Re: Depth level 5 crawling issue

Canan GİRGİN Sat, 29 Jun 2013 12:22:42 -0700

Hi Feng,

While I was debugging another problem, I saw this bug.
Thanks for fixing it. I testted your patch and It worked fine.


Canan



On Sat, Jun 29, 2013 at 6:21 PM, feng lu <[email protected]> wrote:

> Hi Canan
>
> thanks, I create an issue to fix this "count" variable problem in
> https://issues.apache.org/jira/browse/NUTCH-1594.
>
> may be you can testing using bin/nutch command and crawl step by step.
> check the result of 3rd iteration and see what happens.
>
> hope this can help you.
>
>
> On Sat, Jun 29, 2013 at 10:45 PM, feng lu <[email protected]> wrote:
>
>> Hi Canan
>> yes, "count" variable is never changed, may be it is a bug. but you
>> problem may not caused by this issue, in your 3rd iteration it may cause by
>> fetch or parse failure so it will not generate newer outlinks.
>>
>>
>> On Fri, Jun 28, 2013 at 8:52 PM, Canan GİRGİN <[email protected]>wrote:
>>
>>> Hi Lewis,
>>>
>>> 'db.max.outlinks.per.page" parameter is never use in nutch 2.x source
>>> code.
>>>
>>> It controlled by ParseUtil class at this row:
>>> for (int i = 0; count < maxOutlinks && i < outlinks.length; i++)
>>>
>>> But "count" variable is never changed.
>>>
>>> Canan
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Jun 28, 2013 at 2:32 PM, Jamshaid Ashraf 
>>> <[email protected]>wrote:
>>>
>>>> Hi,
>>>>
>>>> I have followed the given link and updated 'db.max.outlinks.per.page'
>>>> to -1
>>>> in 'nutch-default' file.
>>>>
>>>> but facing same issue while crawling '
>>>> http://www.halliburton.com/en-US/default.page & cnn.com', below is the
>>>> last
>>>> line of fetcher job which shows 0 page found on 3rd or 4th iteration.
>>>>
>>>> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
>>>> URLs
>>>> in 0 queues
>>>> -activeThreads=0
>>>> FetcherJob: done
>>>>
>>>> Please note that when I crawl amazon & others sites it works fine. Do
>>>> you
>>>> think is it because of some restriction of halliborton (robot.txt) or
>>>> some
>>>> misconfiguration at my end?
>>>>
>>>> Regards,
>>>> Jamshaid
>>>>
>>>>
>>>> On Fri, Jun 28, 2013 at 12:37 AM, Lewis John Mcgibbney <
>>>> [email protected]> wrote:
>>>>
>>>> > Hi,
>>>> > Can you please try this
>>>> > http://s.apache.org/wIC
>>>> > Thanks
>>>> > Lewis
>>>> >
>>>> >
>>>> > On Thu, Jun 27, 2013 at 8:01 AM, Jamshaid Ashraf <
>>>> [email protected]
>>>> > >wrote:
>>>> >
>>>> > > Hi,
>>>> > >
>>>> > > I'm using nutch 2.x with HBase and tried to crawl "
>>>> > > http://www.halliburton.com/en-US/default.page"; site for depth
>>>> level 5.
>>>> > >
>>>> > > Following is the command:
>>>> > >
>>>> > > bin/crawl urls/seed.txt HB http://localhost:8080/solr/ 5
>>>> > >
>>>> > >
>>>> > > It worked well till 3rd iteration but for remaining 4th and 5th
>>>> nothing
>>>> > > fetched (same case happened with cnn.com). but if i tried to crawl
>>>> other
>>>> > > sites like amazon with depth level 5 it works.
>>>> > >
>>>> > > Could you please guide what could be the reasons for failing of 4th
>>>> and
>>>> > 5th
>>>> > > iteration.
>>>> > >
>>>> > >
>>>> > > Regards,
>>>> > > Jamshaid
>>>> > >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > *Lewis*
>>>> >
>>>>
>>>
>>>
>>
>>
>> --
>> Don't Grow Old, Grow Up... :-)
>>
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Re: Depth level 5 crawling issue

Reply via email to