|
Hi Michael,
What you're proposing will only make a single substitution of two slashes. That is, it won't deal correctly with host.name//dir//page.html or host.name///dir/page.html
I would advise the following.
<regex>
<pattern>([^:])//+</pattern> <substitution>$1/</substitution> </regex> which will substitute any instance of two or more slashes with a single slash; provided those slashes are not immediately preceded by a colon.
Regards,
David.
Date: Wed, 31 Aug 2005 22:25:07 +0200
From: Michael Nebel <[EMAIL PROTECTED]> To: [email protected] Subject: [Nutch-general] need regex-normalize.xml help (crawler trap) Reply-To: [EMAIL PROTECTED] Hi, my crawler got caught by a site with an url-loop. Each time I fetch a page, the same page with one more / is added to the fetchlist. So the urls look like: http://host.name/dir/page.html http://host.name//dir/page.html http://host.name///dir/page.html http://host.name////dir/page.html ... I think, it should be possible to fix this by using the regex-normalize.xml. How does the following rule look? <regex> <pattern>(.*://.*)//(.*)</pattern> <substitution>$1/$2</substitution> </regex> is this ok? Regards Michael -- Michael Nebel http://www.nebel.de/ http://www.netluchs.de/ This email may contain legally privileged information and
is intended only for the addressee. It is not necessarily the official view or
communication of the New Zealand Qualifications Authority. If you are not the
intended recipient you must not use, disclose, copy or distribute this email or
information in it. If you have received
this email in error, please contact the sender
immediately.
NZQA
does not accept any liability for changes made to this email or attachments
after sending by NZQA.
All emails have been scanned for viruses
and content by MailMarshal.
NZQA reserves the right to monitor
all email communications through its network.
|
