[ 
https://issues.apache.org/jira/browse/CONNECTORS-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282231#comment-13282231
 ] 

Karl Wright edited comment on CONNECTORS-477 at 5/24/12 6:49 AM:
-----------------------------------------------------------------

About the fix itself: If you can find a document or standard that describes a 
standard for URL transformation that is not supported by the Web Connector, or 
you can show that such a transformation works in IE and in Firefox, then we 
should modify the WebURL class to support this transformation, and others that 
are similar.  I created the WebURL class specifically for the purpose of 
providing support for URL forms that were unsupported by the Java 
implementation of URI, which is no longer up-to-date as far as standards are 
concerned.  So, if there was going to be a fix for this issue, I'd recommend 
that it be done there, and not in WebcrawlerConnector.

My understanding of how URL encoding was supposed to work was that a URL is 
encoded in links, NOT by the browser (or crawler).  This is necessary because 
the browser does not typically understand the context within a URL correctly.  
Now, Microsoft modified that standard by supporting certain transformations 
within IE, and other browsers have copied those transformations.  If there is 
sufficient support across browsers, we should go ahead and provide a similar 
feature in the web connector.

In order to check whether your transformation of full-width space qualifies as 
feature we should support, you would want to create a website locally (running 
under IIS probably), which has documents with names that include problematic 
characters such as full-width space.  Then, also create a page that has links 
to these documents, in the form <a href="...">...</a>, where the full-width 
space character is NOT properly URL encoded but is exposed.  Browse to the link 
page and click on the link.  Does the browser load the expected document, or 
not?  Which browsers work, and which do not?  If it does seem to be supported, 
are there other characters that work the same, or not?


                
      was (Author: [email protected]):
    About the fix itself: If you can find a document or standard that describes 
a standard for URL transformation that is not supported by the Web Connector, 
or you can show that such a transformation works in IE and in Firefox, then we 
should modify the WebURL class to support this transformation, and others that 
are similar.  I created the WebURL class specifically for the purpose of 
providing support for URL forms that were unsupported by the Java 
implementation of URI, which is no longer up-to-date as far as standards are 
concerned.  So, if there was going to be a fix for this issue, I'd recommend 
that it be done there, and not in WebcrawlerConnector.

My understanding of how URL encoding was supposed to work was that a URL is 
encoded in links, NOT by the browser (or crawler).  This is necessary because 
the browser does not typically understand the context within a URL correctly.  
Now, Microsoft modified that standard by supporting certain transformations 
within IE, and other browsers have copied those transformations.  If there is 
sufficient support across browsers, we should go ahead and provide a similar 
feature in the web connector.

In order to check whether your transformation of full-width space qualifies as 
feature we should support, you would want to create a website locally (running 
under IIS probably), which has documents with names that include problematic 
characters such as full-width space.  Then, also create a page that has links 
to these documents, in the form <a href="...">...</a>, where the full-width 
space character is NOT properly URL encoded but is exposed.  Browse to the link 
page and click on the link.  Does the browser load the expected document, or 
not?  Which browsers work, and which do not?


                  
> Support for full-width space against url
> ----------------------------------------
>
>                 Key: CONNECTORS-477
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-477
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>            Priority: Minor
>             Fix For: ManifoldCF 0.6
>
>         Attachments: CONNECTORS-477.patch
>
>
> When url includes full-width space (" ") MCF can't ingest their documents.
> e.g.
> 1.file name
>  http://server/site1/Shared%20Documents/test/aaa bbb.txt
> 2.path
>  http://localhost/aaa bbb/aaa.txt
> MCF's log says:
> {noformat}
> WEB: Can't use url '/site1/Shared%20Documents/test/aaa bbb.txt' because it is 
> badly formed: Illegal character in path at index 34: 
> /site1/Shared%20Documents/test/aaa bbb.txt
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to