[ 
https://issues.apache.org/jira/browse/HTTPCLIENT-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846602#comment-16846602
 ] 

Oleg Kalnichevski commented on HTTPCLIENT-1990:
-----------------------------------------------

[~ncw]

Unicode characters anywhere in URI components are illegal. Please see RFC 2396, 
appendix A. "Collected BNF for URI".

Please see section  2.1 on correct representation of non-ASCII characters in 
URIs.
{noformat}
2.1 URI and non-ASCII characters
The relationship between URI and characters has been a source of
confusion for characters that are not part of US-ASCII. To describe
the relationship, it is useful to distinguish between a "character"
(as a distinguishable semantic entity) and an "octet" (an 8-bit
byte). There are two mappings, one from URI characters to octets, and
a second from octets to original characters:

URI character sequence->octet sequence->original character sequence
{noformat}
 
As far as I am concerned the current behavior of HttpClient conforms to the 
specification as defined by RFC 2396 and Oracle's URI implementation does not.

{code:java}
@Test
public void testStuff() throws Exception {
    URI uri1 = new URI("http", "somehost", "/üñîçøðé", null);
    System.out.printf("rawPath = %s\n", uri1.getRawPath());
    System.out.printf("path    = %s\n", uri1.getPath());

    URI uri2 = new 
URIBuilder().setScheme("http").setHost("somehost").setPath("üñîçøðé").build();
    System.out.printf("rawPath = %s\n", uri2.getRawPath());
    System.out.printf("path    = %s\n", uri2.getPath());

    URI uri3 = URIUtils.rewriteURI(uri2, null, 
URIUtils.DROP_FRAGMENT_AND_NORMALIZE);
    System.out.printf("rawPath = %s\n", uri3.getRawPath());
    System.out.printf("path    = %s\n", uri3.getPath());
}
{code}

Oleg


> URIUtils.rewriteURI manges unicode characters
> ---------------------------------------------
>
>                 Key: HTTPCLIENT-1990
>                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-1990
>             Project: HttpComponents HttpClient
>          Issue Type: Bug
>          Components: HttpCache
>    Affects Versions: 4.5.8
>            Reporter: Nicholas Wilson
>            Priority: Minor
>
> The following test case illustrates a problem with URIUtils that I have 
> encountered:
> {code:java}
> public class Main {
>   public static void main(String[] args) throws Exception {
>     URI uri = UriComponentsBuilder.fromUriString("https://host/path";)
>       .pathSegment("üñîçøðé")
>       .build()
>       .toUri();
>     System.out.printf("rawPath = %s\n", uri.getRawPath());
>     System.out.printf("path    = %s\n", uri.getPath());
>     uri = URIUtils.rewriteURI(uri, null, 
> URIUtils.DROP_FRAGMENT_AND_NORMALIZE);
>     System.out.printf("rawPath = %s\n", uri.getRawPath());
>     System.out.printf("path    = %s\n", uri.getPath());
>   }
> }
> {code}
> The issue was encontered, since previous versions of httpclient didn't 
> perform the path normalisation (the main caller is ProtocolExec in the HTTP 
> client), and effectively only did URIUtils.DROP_FRAGMENT, so users who 
> upgrade will get the new normalisation feature unexpectedly.
> The bug exhibited by URIUtils.rewriteURI is actually caused by 
> URLEncodedUtils.urlDecode (inside URIBuilder's ctor, which calls 
> URIBuilder.parsePath), which does something truly nasty. It takes a String (a 
> logical sequence of Unicode code points), casts it to a CharBuffer, then 
> iterates over it, slicing the chars to bytes! Strange, but true.
> Unicode characters in a java.net.URI are legal, as far as I can tell, and 
> should be simply escaped as percent-encoded UTF-8 bytes as returned by 
> URI.getRawPath - but! - not when returned unescaped by URI.getPath, which is 
> what URIUtils.rewriteURI uses.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to