[Nutch-dev] Re: Why Crawl failed to fetch so many pages?

Jack Tang Wed, 13 Apr 2005 06:02:10 -0700

try 

+^http://news.buaa.edu.cn/*


On 4/13/05, cao yuzhong <[EMAIL PROTECTED]> wrote:
> 
> I want crawl all the pages in http://news.buaa.edu.cn
> 
> Following is my crawl-urlfilter.txt:
> 
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto|https):
> 
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
> 
> # skip URLs containing certain characters as probable queries, etc.
> # [EMAIL PROTECTED]
> 
> # accept anything else
> #+.
> +^http://news.buaa.edu.cn


> I started crawl at the url: http://news.buaa.edu.cn and the crawl depth is
> 50.
> But I found many pages in http://news.buaa.edu.cn have not been fetched.
> 
> For example:
> Url http://news.buaa.edu.cn/sortnews.php?type=1 have been fetched.
> It links to many pages.
> Some of those pages have been fetched also.
> such as
> http://news.buaa.edu.cn/dispnews.php?type=1&nid=2508&s_table=news_txt
> 
> but many failed to be fetched,such as:
> http://news.buaa.edu.cn/dispnews.php?type=1&nid=2500&s_table=news_txt
> http://news.buaa.edu.cn/dispnews.php?type=1&nid=2504&s_table=news_txt
> http://news.buaa.edu.cn/dispnews.php?type=1&nid=2505&s_table=news_txt
> 
> What's wrong about the Crawl?
> 
> Best regards.
> Cao Yuzhong
> 2005-04-13
> 
>


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_ide95&alloc_id396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Why Crawl failed to fetch so many pages?

Reply via email to