Hello Cao, The problem is not that the URLs are relative - it's the ? and = characters. Try changing the line [EMAIL PROTECTED] to [EMAIL PROTECTED] and the problem will go away.
Kind regards, David. From: "cao yuzhong" <[EMAIL PROTECTED]> To: [email protected] Date: Thu, 14 Apr 2005 03:31:29 +0000 Subject: [Nutch-dev] Crawl-urlfilter cann't deals with relative urls appropriately ?? Reply-To: [EMAIL PROTECTED] I just want to fetch all the pages in http://news.buaa.edu.cn So I modified my crawl-urlfilter.txt like this: #------------ # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto|https): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$ # skip URLs containing certain characters as probable queries, etc. # [EMAIL PROTECTED] # accept anything else +^http://news.buaa.edu.cn/* #------------- But I found many pages failed to be fetched. Those pages have relative urls such as <a href="dispnews.php?type=1&nid=2442&s_table=news_txt"> in page http://news.buaa.edu.cn/sortnews.php?type=1 . Cann't crawler deals with relative urls appropriately? What can I do to fetch all the pages in a website completely? Best regards. Cao Yuzhong 2005-04-14 ******************************************************************************** This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA. All emails have been scanned for viruses and content by MailMarshal. NZQA reserves the right to monitor all email communications through its network. ********************************************************************************
