[jira] [Created] (HTTPCLIENT-1990) URIUtils.rewriteURI manges unicode characters

Nicholas Wilson (JIRA) Wed, 22 May 2019 11:24:26 -0700

Nicholas Wilson created HTTPCLIENT-1990:
-------------------------------------------


             Summary: URIUtils.rewriteURI manges unicode characters
                 Key: HTTPCLIENT-1990
                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-1990
             Project: HttpComponents HttpClient
          Issue Type: Bug
          Components: HttpCache
    Affects Versions: 4.5.8
            Reporter: Nicholas Wilson


The following test case illustrates a problem with URIUtils that I have 
encountered:
{code:java}
public class Main {
  public static void main(String[] args) throws Exception {
    URI uri = UriComponentsBuilder.fromUriString("https://host/path";)
      .pathSegment("üñîçøðé")
      .build()
      .toUri();
    System.out.printf("rawPath = %s\n", uri.getRawPath());
    System.out.printf("path    = %s\n", uri.getPath());

    uri = URIUtils.rewriteURI(uri, null, URIUtils.DROP_FRAGMENT_AND_NORMALIZE);
    System.out.printf("rawPath = %s\n", uri.getRawPath());
    System.out.printf("path    = %s\n", uri.getPath());
  }
}
{code}
The issue was encontered, since previous versions of httpclient didn't perform 
the path normalisation (the main caller is ProtocolExec in the HTTP client), 
and effectively only did URIUtils.DROP_FRAGMENT, so users who upgrade will get 
the new normalisation feature unexpectedly.

The bug exhibited by URIUtils.rewriteURI is actually caused by 
URLEncodedUtils.urlDecode (inside URIBuilder's ctor, which calls 
URIBuilder.parsePath), which does something truly nasty. It takes a String (a 
logical sequence of Unicode code points), casts it to a CharBuffer, then 
iterates over it, slicing the chars to bytes! Strange, but true.

Unicode characters in a java.net.URI are legal, as far as I can tell, and 
should be simply escaped as percent-encoded UTF-8 bytes as returned by 
URI.getRawPath - but! - not when returned unescaped by URI.getPath, which is 
what URIUtils.rewriteURI uses.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (HTTPCLIENT-1990) URIUtils.rewriteURI manges unicode characters

Reply via email to