Re: Nutch crawler is breadth-first ?

Jack Tang Wed, 07 Sep 2005 00:27:50 -0700

Hi Andrzej 

First of all, thanks for your quick response.

On 9/7/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Jack Tang wrote:
> > Hi All
> >
> > Is nutch crawler breadth-first one? It seems a lot of URLs are lost
> > while I try do breadth-first crawling, I set the depth to 3.
> > Any comments?
> 
> Yes, and yes - there is a possiblity that some urls are lost, if they
> require maintaining a single session. If you encounter such sites, a
> depth-first crawler would be better.

The website does not require maintaining a single session.
my experimentation is designed like this:

X.html contains a list of URLs, say
http://www.a.com/x1.html
http://www.a.com/x2.html
http://www.a.com/x3.html
http://www.a.com/x4.html
http://www.a.com/x5.html
http://www.a.com/x6.html
http://www.a.com/x7.html
....
http://www.a.com/x30.html

I set the depth of crawler is 3 and X.html as its url feed.
And I use urlfilter-prefix as URL filter. (prefix=http://www.a.com)
In my parser, I count the URL, it is 10.

However, If I put all 30 URL into url feed file, in parser, it is right.
Odd?

Regards
/Jack
> It's not too difficult to build one, using the tools already present in
> Nutch. Contributions are welcome... ;-)
> 
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 

-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: Nutch crawler is breadth-first ?

Reply via email to