Crawl-urlfilter cann't deals with relative urls appropriately ??

cao yuzhong Wed, 13 Apr 2005 20:31:32 -0700

I just want to fetch all the pages in http://news.buaa.edu.cn
So I modified my crawl-urlfilter.txt like this:
#------------
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto|https):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.
# [EMAIL PROTECTED]
# accept anything else
+^http://news.buaa.edu.cn/*
#-------------

But I found many pages failed to be fetched.
Those pages have relative urls
such as <a href="dispnews.php?type=1&nid=2442&s_table=news_txt">
in page http://news.buaa.edu.cn/sortnews.php?type=1 .

Cann't crawler deals with  relative urls appropriately?
What can I do to fetch all the pages in a website completely?

Best regards.
Cao Yuzhong
2005-04-14

Crawl-urlfilter cann't deals with relative urls appropriately ??

Reply via email to