[Nutch-general] Re: need regex-normalize.xml help (crawler trap) (Michael Nebel)

David Wallace Wed, 31 Aug 2005 21:18:30 -0700

Hi Michael,

What you're proposing will only make a single substitution of two slashes. That is, it won't deal correctly with host.name//dir//page.html or host.name///dir/page.html

I would advise the following.

which will substitute any instance of two or more slashes with a single slash; provided those slashes are not immediately preceded by a colon.

Regards,

David.

Date: Wed, 31 Aug 2005 22:25:07 +0200
From: Michael Nebel <[EMAIL PROTECTED]>
To: [email protected]
Subject: [Nutch-general] need regex-normalize.xml help (crawler trap)
Reply-To: [EMAIL PROTECTED]

Hi,

my crawler got caught by a site with an url-loop. Each time I fetch a
page, the same page with one more / is added to the fetchlist. So the
urls look like:

    http://host.name/dir/page.html
    http://host.name//dir/page.html
    http://host.name///dir/page.html
    http://host.name////dir/page.html
    ...

I think, it should be possible to fix this by using the
regex-normalize.xml. How does the following rule look?

    <regex>
    <pattern>(.*://.*)//(.*)</pattern>
    <substitution>$1/$2</substitution>
    </regex>

is this ok?

Regards

    Michael

--
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/

This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA.

All emails have been scanned for viruses and content by MailMarshal.

NZQA reserves the right to monitor all email communications through its network.

[Nutch-general] Re: need regex-normalize.xml help (crawler trap) (Michael Nebel)

Reply via email to