[Nutch-general] Re: Nutch0.6 and Nutch 0.7 crawlers

eric park Wed, 12 Apr 2006 12:38:01 -0700

hello, the problem is they are not unwanted URLS.
I crawled on the site 'www.qmind.co.kr'. I found that the nutch7.0 crawler
works just fine in first depth. However in second depth,  it filters out any
links that start with 'www.qmind.co.kr'.  It only crawls urls starting with
'qmind.co.kr'.  I can't figure out why it filters out urls starting with
'www' in second depth. Nutch 6.0 works just fine. Are there any known bugs
in Nutch7.0 crawler?


thank you,
Erci Park

2006/4/12, Andrzej Bialecki <[EMAIL PROTECTED]>:
>
> eric park wrote:
> > hello. I tried to crawl a certain site using both nutch 0.6 and nutch
> 0.7,
> > just to compare how they are different.
> >
> > However I get less urls crawled using nutch0-7 than nutch0-6.   I'll
> paste 2
> > different log files below.
> >
> >
> >
> > As you can see below, both 0.6 and 0.7 fetch same number of urls in
> first
> > depth, but in second depth, nutch0.7 fetches only 15 urls while
> > nutch0.7fetches 34 urls.  Of course, the configuration and settings
> > are same.
> >
>
> IIRC (it was long ago...) the version 0.6 had a bug where unwanted URLs
> would slip through the URLFilters. This was tightened in 0.7. Please
> check that the URLs that are rejected in 0.7 are really valid URLs, i.e.
> that they should be accepted.
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

[Nutch-general] Re: Nutch0.6 and Nutch 0.7 crawlers

Reply via email to