There's another way of dealing with this.  You don't need to add or change any of the Java code.  Add these lines to regex-normalize.xml:
<regex>
  <pattern> </pattern>
  <substitution>%20</substitution>
</regex>
(note that there's a space inside the "pattern" tag).  Also add these lines to nutch-site.xml, if they're not there already:
    <property>
        <name>urlnormalizer.class</name>
        <value>org.apache.nutch.net.RegexUrlNormalizer</value>
        <description>Name of the class used to normalize URLs.</description>
    </property>
    <property>
        <name>urlnormalizer.regex.file</name>
        <value>regex-normalize.xml</value>
        <description>Name of the config file used by the RegexUrlNormalizer class.</description>
    </property>
What this does is to fix up any URLs with spaces in, BEFORE adding them to the URL database.  The space gets changed to %20, which is how you want to store it.  Then when the fetch list tool pulls them out, they're already in the right form.  Note that this won't help with any URLs that are already in your database.
 
Regards,
David.
 
 
 
> Date: Tue, 25 Oct 2005 15:51:09 +1000
> From: Ben <[EMAIL PROTECTED]>
> To: [email protected]
> Subject: [Nutch-general] Re: Problem crawling a site when url contains spaces

> Hi

> I had this problem and I applied the escapeWhitespace method from
> Heritrix 1.4.0 to the HttpResponse class. Here is the full code for
> the method:

[ Snipped out ]

> HTH
> -Ben

> On 10/25/05, Mohini Padhye <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>> I get several of the following errors while doing an intranet crawl....
>>
>> fetch of http://www.mysite.com/About <http://www.mysite.com/About>
>> Site/Case Studies/page1419.html failed with:
>> net.nutch.protocol.http.HttpError: HTTP Error: 4 00
>>
>> The reason for this is that the url contains spaces (which is
>> represented as %20 in the url).=3D20 What is the solution for crawling a
>> website with url containing spaces?
>>
>> Can I add some regex for this problem in regex-urlfilter.txt file?
>>
>> Thanks,
>>
>> Mohini

This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA.
 
All emails have been scanned for viruses and content by MailMarshal.
NZQA reserves the right to monitor all email communications through its network.
 

Reply via email to