This isn't the reason.
I had commented the line [EMAIL PROTECTED]:
[EMAIL PROTECTED]
Thank you.

From: "David Wallace" <[EMAIL PROTECTED]>
Reply-To: [email protected]
To: <[EMAIL PROTECTED]>
Subject: Re: [Nutch-dev] Crawl-urlfilter cann't deals with relativeurls
appropriately ??
Date: Fri, 15 Apr 2005 08:13:06 +1200

Hello Cao,
The problem is not that the URLs are relative - it's the ? and =
characters.  Try changing the line
[EMAIL PROTECTED] to [EMAIL PROTECTED]
and the problem will go away.

Kind regards,
David.


From: "cao yuzhong" <[EMAIL PROTECTED]> To: [email protected] Date: Thu, 14 Apr 2005 03:31:29 +0000 Subject: [Nutch-dev] Crawl-urlfilter cann't deals with relative urls appropriately ?? Reply-To: [EMAIL PROTECTED]

I just want to fetch all the pages in http://news.buaa.edu.cn
So I modified my crawl-urlfilter.txt like this:
#------------
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto|https):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$


# skip URLs containing certain characters as probable queries, etc. # [EMAIL PROTECTED] # accept anything else +^http://news.buaa.edu.cn/* #-------------

But I found many pages failed to be fetched.
Those pages have relative urls
such as <a href="dispnews.php?type=1&nid=2442&s_table=news_txt">
in page http://news.buaa.edu.cn/sortnews.php?type=1 .

Cann't crawler deals with  relative urls appropriately?
What can I do to fetch all the pages in a website completely?

Best regards.
Cao Yuzhong
2005-04-14






********************************************************************************

This email may contain legally privileged information and is intended only
for the addressee. It is not necessarily the official view or
communication of the New Zealand Qualifications Authority. If you are not
the intended recipient you must not use, disclose, copy or distribute this email or
information in it. If you have received this email in error, please
contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA.

All emails have been scanned for viruses and content by MailMarshal.
NZQA reserves the right to monitor all email communications through its
network.

********************************************************************************





------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to