[ https://issues.apache.org/jira/browse/NUTCH-661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647387#action_12647387 ]
kristian-b edited comment on NUTCH-661 at 11/13/08 12:39 PM: -------------------------------------------------------------- The Uniform Resource Locators (URL) specification (RFC 1738) defines the space character as "unsafe" and states that all unsafe characters must always be encoded within a URL. The URL encoding for space is %20 so the document URL should be: http://intranet-rtd.rtd.cec.eu.int/services/docs/AAR_2007%20-%20FINAL.doc Your intranet uses invalid URLs that do not conform to the spec. While Nutch could handle this error more gracefully, it shouldn't necessarily allow malformed URLs to be fetched. It is because some clients attempt to automatically encode illegal characters when they are encountered that people don't realise there are problems with their links. was (Author: kristian-b): The Uniform Resource Locators (URL) specification (RFC 1738) defines the space character as "unsafe" and states that all unsafe characters must always be encoded within a URL. The URL encoding for space is %20 so the document URL should be: http://intranet-rtd.rtd.cec.eu.int/services/docs/AAR_2007%20-%20FINAL.doc Your intranet uses invalid URLs that do not conform to the spec. While Nutch should handle this error more gracefully, it shouldn't necessarily allow malformed URLs to be fetched. It is because some clients attempt to automatically encode illegal characters when they are encountered that people don't realise there are problems with their links. > errors when the uri contains space characters > ---------------------------------------------- > > Key: NUTCH-661 > URL: https://issues.apache.org/jira/browse/NUTCH-661 > Project: Nutch > Issue Type: Improvement > Components: fetcher > Affects Versions: 0.9.0 > Environment: RedHat 5.1 > Reporter: Christos LAIOS > > While spidering our intranet, i get the following errors when the uri > contains space characters > fetch of http://intranet-rtd.rtd.cec.eu.int/services/docs/AAR_2007 - > FINAL.doc failed with: java.lang.IllegalArgumentException: Invalid uri > 'http://intranet-rtd.rtd.cec.eu.int/services/docs/AAR_2007 - FINAL.doc': > escaped absolute path not valid -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.