[
https://issues.apache.org/jira/browse/HTTPCLIENT-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846602#comment-16846602
]
Oleg Kalnichevski commented on HTTPCLIENT-1990:
-----------------------------------------------
[~ncw]
Unicode characters anywhere in URI components are illegal. Please see RFC 2396,
appendix A. "Collected BNF for URI".
Please see section 2.1 on correct representation of non-ASCII characters in
URIs.
{noformat}
2.1 URI and non-ASCII characters
The relationship between URI and characters has been a source of
confusion for characters that are not part of US-ASCII. To describe
the relationship, it is useful to distinguish between a "character"
(as a distinguishable semantic entity) and an "octet" (an 8-bit
byte). There are two mappings, one from URI characters to octets, and
a second from octets to original characters:
URI character sequence->octet sequence->original character sequence
{noformat}
As far as I am concerned the current behavior of HttpClient conforms to the
specification as defined by RFC 2396 and Oracle's URI implementation does not.
{code:java}
@Test
public void testStuff() throws Exception {
URI uri1 = new URI("http", "somehost", "/üñîçøðé", null);
System.out.printf("rawPath = %s\n", uri1.getRawPath());
System.out.printf("path = %s\n", uri1.getPath());
URI uri2 = new
URIBuilder().setScheme("http").setHost("somehost").setPath("üñîçøðé").build();
System.out.printf("rawPath = %s\n", uri2.getRawPath());
System.out.printf("path = %s\n", uri2.getPath());
URI uri3 = URIUtils.rewriteURI(uri2, null,
URIUtils.DROP_FRAGMENT_AND_NORMALIZE);
System.out.printf("rawPath = %s\n", uri3.getRawPath());
System.out.printf("path = %s\n", uri3.getPath());
}
{code}
Oleg
> URIUtils.rewriteURI manges unicode characters
> ---------------------------------------------
>
> Key: HTTPCLIENT-1990
> URL: https://issues.apache.org/jira/browse/HTTPCLIENT-1990
> Project: HttpComponents HttpClient
> Issue Type: Bug
> Components: HttpCache
> Affects Versions: 4.5.8
> Reporter: Nicholas Wilson
> Priority: Minor
>
> The following test case illustrates a problem with URIUtils that I have
> encountered:
> {code:java}
> public class Main {
> public static void main(String[] args) throws Exception {
> URI uri = UriComponentsBuilder.fromUriString("https://host/path")
> .pathSegment("üñîçøðé")
> .build()
> .toUri();
> System.out.printf("rawPath = %s\n", uri.getRawPath());
> System.out.printf("path = %s\n", uri.getPath());
> uri = URIUtils.rewriteURI(uri, null,
> URIUtils.DROP_FRAGMENT_AND_NORMALIZE);
> System.out.printf("rawPath = %s\n", uri.getRawPath());
> System.out.printf("path = %s\n", uri.getPath());
> }
> }
> {code}
> The issue was encontered, since previous versions of httpclient didn't
> perform the path normalisation (the main caller is ProtocolExec in the HTTP
> client), and effectively only did URIUtils.DROP_FRAGMENT, so users who
> upgrade will get the new normalisation feature unexpectedly.
> The bug exhibited by URIUtils.rewriteURI is actually caused by
> URLEncodedUtils.urlDecode (inside URIBuilder's ctor, which calls
> URIBuilder.parsePath), which does something truly nasty. It takes a String (a
> logical sequence of Unicode code points), casts it to a CharBuffer, then
> iterates over it, slicing the chars to bytes! Strange, but true.
> Unicode characters in a java.net.URI are legal, as far as I can tell, and
> should be simply escaped as percent-encoded UTF-8 bytes as returned by
> URI.getRawPath - but! - not when returned unescaped by URI.getPath, which is
> what URIUtils.rewriteURI uses.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]