Re: Depth level 5 crawling issue

feng lu Sat, 29 Jun 2013 08:22:56 -0700

Hi Canan

thanks, I create an issue to fix this "count" variable problem in
https://issues.apache.org/jira/browse/NUTCH-1594.


may be you can testing using bin/nutch command and crawl step by step.
check the result of 3rd iteration and see what happens.

hope this can help you.


On Sat, Jun 29, 2013 at 10:45 PM, feng lu <[email protected]> wrote:

> Hi Canan
> yes, "count" variable is never changed, may be it is a bug. but you
> problem may not caused by this issue, in your 3rd iteration it may cause by
> fetch or parse failure so it will not generate newer outlinks.
>
>
> On Fri, Jun 28, 2013 at 8:52 PM, Canan GİRGİN <[email protected]>wrote:
>
>> Hi Lewis,
>>
>> 'db.max.outlinks.per.page" parameter is never use in nutch 2.x source
>> code.
>>
>> It controlled by ParseUtil class at this row:
>> for (int i = 0; count < maxOutlinks && i < outlinks.length; i++)
>>
>> But "count" variable is never changed.
>>
>> Canan
>>
>>
>>
>>
>>
>> On Fri, Jun 28, 2013 at 2:32 PM, Jamshaid Ashraf 
>> <[email protected]>wrote:
>>
>>> Hi,
>>>
>>> I have followed the given link and updated 'db.max.outlinks.per.page' to
>>> -1
>>> in 'nutch-default' file.
>>>
>>> but facing same issue while crawling '
>>> http://www.halliburton.com/en-US/default.page & cnn.com', below is the
>>> last
>>> line of fetcher job which shows 0 page found on 3rd or 4th iteration.
>>>
>>> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
>>> URLs
>>> in 0 queues
>>> -activeThreads=0
>>> FetcherJob: done
>>>
>>> Please note that when I crawl amazon & others sites it works fine. Do you
>>> think is it because of some restriction of halliborton (robot.txt) or
>>> some
>>> misconfiguration at my end?
>>>
>>> Regards,
>>> Jamshaid
>>>
>>>
>>> On Fri, Jun 28, 2013 at 12:37 AM, Lewis John Mcgibbney <
>>> [email protected]> wrote:
>>>
>>> > Hi,
>>> > Can you please try this
>>> > http://s.apache.org/wIC
>>> > Thanks
>>> > Lewis
>>> >
>>> >
>>> > On Thu, Jun 27, 2013 at 8:01 AM, Jamshaid Ashraf <
>>> [email protected]
>>> > >wrote:
>>> >
>>> > > Hi,
>>> > >
>>> > > I'm using nutch 2.x with HBase and tried to crawl "
>>> > > http://www.halliburton.com/en-US/default.page"; site for depth level
>>> 5.
>>> > >
>>> > > Following is the command:
>>> > >
>>> > > bin/crawl urls/seed.txt HB http://localhost:8080/solr/ 5
>>> > >
>>> > >
>>> > > It worked well till 3rd iteration but for remaining 4th and 5th
>>> nothing
>>> > > fetched (same case happened with cnn.com). but if i tried to crawl
>>> other
>>> > > sites like amazon with depth level 5 it works.
>>> > >
>>> > > Could you please guide what could be the reasons for failing of 4th
>>> and
>>> > 5th
>>> > > iteration.
>>> > >
>>> > >
>>> > > Regards,
>>> > > Jamshaid
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > *Lewis*
>>> >
>>>
>>
>>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Depth level 5 crawling issue

Reply via email to