[ 
https://issues.apache.org/jira/browse/NUTCH-661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647387#action_12647387
 ] 

kristian-b edited comment on NUTCH-661 at 11/13/08 12:48 PM:
--------------------------------------------------------------

The Uniform Resource Locators (URL) specification (RFC 1738) defines the space 
character as "unsafe" and states that all unsafe characters must always be 
encoded within a URL. The URL encoding for space is %20 so the document URL 
should be: 
http://intranet-rtd.rtd.cec.eu.int/services/docs/AAR_2007%20-%20FINAL.doc
 
Your intranet includes links to invalid URLs that do not conform to the spec. 
While Nutch could handle this error, it shouldn't necessarily allow malformed 
URLs to be fetched. It is because some clients attempt to automatically encode 
illegal characters when they are encountered that people don't realise there 
are problems with their links.

      was (Author: kristian-b):
    The Uniform Resource Locators (URL) specification (RFC 1738) defines the 
space character as "unsafe" and states that all unsafe characters must always 
be encoded within a URL. The URL encoding for space is %20 so the document URL 
should be: 
http://intranet-rtd.rtd.cec.eu.int/services/docs/AAR_2007%20-%20FINAL.doc
 
Your intranet uses invalid URLs that do not conform to the spec. While Nutch 
could handle this error more gracefully, it shouldn't necessarily allow 
malformed URLs to be fetched. It is because some clients attempt to 
automatically encode illegal characters when they are encountered that people 
don't realise there are problems with their links.
  
> errors when the uri contains space characters 
> ----------------------------------------------
>
>                 Key: NUTCH-661
>                 URL: https://issues.apache.org/jira/browse/NUTCH-661
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: RedHat 5.1
>            Reporter: Christos LAIOS
>
> While spidering our intranet, i get the following errors when the uri 
> contains space characters
> fetch of http://intranet-rtd.rtd.cec.eu.int/services/docs/AAR_2007 - 
> FINAL.doc failed with: java.lang.IllegalArgumentException: Invalid uri 
> 'http://intranet-rtd.rtd.cec.eu.int/services/docs/AAR_2007 - FINAL.doc': 
> escaped absolute path not valid

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to