Re: Depth level 5 crawling issue

Canan GİRGİN Fri, 28 Jun 2013 05:53:01 -0700

Hi Lewis,

'db.max.outlinks.per.page" parameter is never use in nutch 2.x source code.



It controlled by ParseUtil class at this row:
for (int i = 0; count < maxOutlinks && i < outlinks.length; i++)

But "count" variable is never changed.

Canan





On Fri, Jun 28, 2013 at 2:32 PM, Jamshaid Ashraf <[email protected]>wrote:

> Hi,
>
> I have followed the given link and updated 'db.max.outlinks.per.page' to -1
> in 'nutch-default' file.
>
> but facing same issue while crawling '
> http://www.halliburton.com/en-US/default.page & cnn.com', below is the
> last
> line of fetcher job which shows 0 page found on 3rd or 4th iteration.
>
> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
> in 0 queues
> -activeThreads=0
> FetcherJob: done
>
> Please note that when I crawl amazon & others sites it works fine. Do you
> think is it because of some restriction of halliborton (robot.txt) or some
> misconfiguration at my end?
>
> Regards,
> Jamshaid
>
>
> On Fri, Jun 28, 2013 at 12:37 AM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
> > Hi,
> > Can you please try this
> > http://s.apache.org/wIC
> > Thanks
> > Lewis
> >
> >
> > On Thu, Jun 27, 2013 at 8:01 AM, Jamshaid Ashraf <[email protected]
> > >wrote:
> >
> > > Hi,
> > >
> > > I'm using nutch 2.x with HBase and tried to crawl "
> > > http://www.halliburton.com/en-US/default.page"; site for depth level 5.
> > >
> > > Following is the command:
> > >
> > > bin/crawl urls/seed.txt HB http://localhost:8080/solr/ 5
> > >
> > >
> > > It worked well till 3rd iteration but for remaining 4th and 5th nothing
> > > fetched (same case happened with cnn.com). but if i tried to crawl
> other
> > > sites like amazon with depth level 5 it works.
> > >
> > > Could you please guide what could be the reasons for failing of 4th and
> > 5th
> > > iteration.
> > >
> > >
> > > Regards,
> > > Jamshaid
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>

Re: Depth level 5 crawling issue

Reply via email to