Hi I found the reason. The value of maximum number of outlinks that nutch willl process for a page is 100. And the website contains more than 300 URLs in the page. Now, everything is ok.
/Jack On 9/7/05, Jack Tang <[EMAIL PROTECTED]> wrote: > Hi Andrzej > > First of all, thanks for your quick response. > > On 9/7/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > Jack Tang wrote: > > > Hi All > > > > > > Is nutch crawler breadth-first one? It seems a lot of URLs are lost > > > while I try do breadth-first crawling, I set the depth to 3. > > > Any comments? > > > > Yes, and yes - there is a possiblity that some urls are lost, if they > > require maintaining a single session. If you encounter such sites, a > > depth-first crawler would be better. > > The website does not require maintaining a single session. > my experimentation is designed like this: > > X.html contains a list of URLs, say > http://www.a.com/x1.html > http://www.a.com/x2.html > http://www.a.com/x3.html > http://www.a.com/x4.html > http://www.a.com/x5.html > http://www.a.com/x6.html > http://www.a.com/x7.html > .... > http://www.a.com/x30.html > > I set the depth of crawler is 3 and X.html as its url feed. > And I use urlfilter-prefix as URL filter. (prefix=http://www.a.com) > In my parser, I count the URL, it is 10. > > However, If I put all 30 URL into url feed file, in parser, it is right. > Odd? > > Regards > /Jack > > It's not too difficult to build one, using the tools already present in > > Nutch. Contributions are welcome... ;-) > > > > -- > > Best regards, > > Andrzej Bialecki <>< > > ___. ___ ___ ___ _ _ __________________________________ > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > ___|||__|| \| || | Embedded Unix, System Integration > > http://www.sigram.com Contact: info at sigram dot com > > > > > > > -- > Keep Discovering ... ... > http://www.jroller.com/page/jmars > -- Keep Discovering ... ... http://www.jroller.com/page/jmars
