Nicholas Wilson created HTTPCLIENT-1990:
-------------------------------------------
Summary: URIUtils.rewriteURI manges unicode characters
Key: HTTPCLIENT-1990
URL: https://issues.apache.org/jira/browse/HTTPCLIENT-1990
Project: HttpComponents HttpClient
Issue Type: Bug
Components: HttpCache
Affects Versions: 4.5.8
Reporter: Nicholas Wilson
The following test case illustrates a problem with URIUtils that I have
encountered:
{code:java}
public class Main {
public static void main(String[] args) throws Exception {
URI uri = UriComponentsBuilder.fromUriString("https://host/path")
.pathSegment("üñîçøðé")
.build()
.toUri();
System.out.printf("rawPath = %s\n", uri.getRawPath());
System.out.printf("path = %s\n", uri.getPath());
uri = URIUtils.rewriteURI(uri, null, URIUtils.DROP_FRAGMENT_AND_NORMALIZE);
System.out.printf("rawPath = %s\n", uri.getRawPath());
System.out.printf("path = %s\n", uri.getPath());
}
}
{code}
The issue was encontered, since previous versions of httpclient didn't perform
the path normalisation (the main caller is ProtocolExec in the HTTP client),
and effectively only did URIUtils.DROP_FRAGMENT, so users who upgrade will get
the new normalisation feature unexpectedly.
The bug exhibited by URIUtils.rewriteURI is actually caused by
URLEncodedUtils.urlDecode (inside URIBuilder's ctor, which calls
URIBuilder.parsePath), which does something truly nasty. It takes a String (a
logical sequence of Unicode code points), casts it to a CharBuffer, then
iterates over it, slicing the chars to bytes! Strange, but true.
Unicode characters in a java.net.URI are legal, as far as I can tell, and
should be simply escaped as percent-encoded UTF-8 bytes as returned by
URI.getRawPath - but! - not when returned unescaped by URI.getPath, which is
what URIUtils.rewriteURI uses.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]