Hi Andrzej First of all, thanks for your quick response.
On 9/7/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Jack Tang wrote: > > Hi All > > > > Is nutch crawler breadth-first one? It seems a lot of URLs are lost > > while I try do breadth-first crawling, I set the depth to 3. > > Any comments? > > Yes, and yes - there is a possiblity that some urls are lost, if they > require maintaining a single session. If you encounter such sites, a > depth-first crawler would be better. The website does not require maintaining a single session. my experimentation is designed like this: X.html contains a list of URLs, say http://www.a.com/x1.html http://www.a.com/x2.html http://www.a.com/x3.html http://www.a.com/x4.html http://www.a.com/x5.html http://www.a.com/x6.html http://www.a.com/x7.html .... http://www.a.com/x30.html I set the depth of crawler is 3 and X.html as its url feed. And I use urlfilter-prefix as URL filter. (prefix=http://www.a.com) In my parser, I count the URL, it is 10. However, If I put all 30 URL into url feed file, in parser, it is right. Odd? Regards /Jack > It's not too difficult to build one, using the tools already present in > Nutch. Contributions are welcome... ;-) > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Keep Discovering ... ... http://www.jroller.com/page/jmars
