I want crawl all the pages in http://news.buaa.edu.cn

Following is my crawl-urlfilter.txt:

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto|https):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$


# skip URLs containing certain characters as probable queries, etc. # [EMAIL PROTECTED]

# accept anything else
#+.
+^http://news.buaa.edu.cn


I started crawl at the url: http://news.buaa.edu.cn and the crawl depth is 50.
But I found many pages in http://news.buaa.edu.cn have not been fetched.


For example:
Url http://news.buaa.edu.cn/sortnews.php?type=1 have been fetched.
It links to many pages.
Some of those pages have been fetched also.
such as http://news.buaa.edu.cn/dispnews.php?type=1&nid=2508&s_table=news_txt


but many failed to be fetched,such as:
http://news.buaa.edu.cn/dispnews.php?type=1&nid=2500&s_table=news_txt
http://news.buaa.edu.cn/dispnews.php?type=1&nid=2504&s_table=news_txt
http://news.buaa.edu.cn/dispnews.php?type=1&nid=2505&s_table=news_txt

What's wrong about the Crawl?

Best regards.
Cao Yuzhong
2005-04-13




------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to