I just want to fetch all the pages in http://news.buaa.edu.cn So I modified my crawl-urlfilter.txt like this: #------------ # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto|https): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
# skip URLs containing certain characters as probable queries, etc. # [EMAIL PROTECTED] # accept anything else +^http://news.buaa.edu.cn/* #-------------
But I found many pages failed to be fetched. Those pages have relative urls such as <a href="dispnews.php?type=1&nid=2442&s_table=news_txt"> in page http://news.buaa.edu.cn/sortnews.php?type=1 .
Cann't crawler deals with relative urls appropriately? What can I do to fetch all the pages in a website completely?
Best regards. Cao Yuzhong 2005-04-14
