[Nutch-dev] Re: Why Crawl failed to fetch so many pages?

Matthias Jaekle Thu, 14 Apr 2005 02:00:23 -0700

try +^http://news.buaa.edu.cn/*

This should not be the reason.
Your regex fits on urls starting with:
http://news.buaa.edu.cn
http://news.buaa.edu.cn/
http://news.buaa.edu.cn//
http://news.buaa.edu.cn/// ...

The only thing I would try is to escape some caracters to make it more correct. A dot means every possible sign. Better: +^http:\/\/news\.buaa\.edu\.cn

Did you make enough rounds, to get the wanted depth?
With every crawl you only fetch the already known links.

Matthias

--
http://www.eventax.com - eventax GmbH
http://www.umkreisfinder.de - Die Suchmaschine f�r Lokales und Events


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Why Crawl failed to fetch so many pages?

Reply via email to