[
https://issues.apache.org/jira/browse/CONNECTORS-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282231#comment-13282231
]
Karl Wright commented on CONNECTORS-477:
----------------------------------------
If you can find a document or standard that describes a standard for URL
transformation that is not supported by the Web Connector, or you can show that
such a transformation works in IE and in Firefox, then we should modify the
WebURL class to support that transformation. I created the WebURL class
specifically for the purpose of providing support for URL forms that were
unsupported by the Java implementation of URI, which is no longer up-to-date as
far as standards are concerned. So, if there was going to be a fix for this
issue, I'd recommend that it be done there, and not in WebcrawlerConnector.
But, as I said before, I'd be very careful to avoid trying to make the Web
Connector into a replacement for the SharePoint connector. My understanding of
how URL encoding was supposed to work was that a URL is encoded in links, NOT
by the browser (or crawler). This is necessary because the browser does not
typically understand the context within the URL correctly. Microsoft modified
that standard by supporting certain transformations within IE, and other
browsers copied those transformations. SharePoint may be relying on such
non-standard transformations to work correctly - either that, or it never
presents non-standard URLs as links at all.
> Support for full-width space against url
> ----------------------------------------
>
> Key: CONNECTORS-477
> URL: https://issues.apache.org/jira/browse/CONNECTORS-477
> Project: ManifoldCF
> Issue Type: Improvement
> Components: Web connector
> Reporter: Shinichiro Abe
> Assignee: Shinichiro Abe
> Priority: Minor
> Fix For: ManifoldCF 0.6
>
> Attachments: CONNECTORS-477.patch
>
>
> When url includes full-width space (" ") MCF can't ingest their documents.
> e.g.
> 1.file name
> http://server/site1/Shared%20Documents/test/aaa bbb.txt
> 2.path
> http://localhost/aaa bbb/aaa.txt
> MCF's log says:
> {noformat}
> WEB: Can't use url '/site1/Shared%20Documents/test/aaa bbb.txt' because it is
> badly formed: Illegal character in path at index 34:
> /site1/Shared%20Documents/test/aaa bbb.txt
> {noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira