[
https://issues.apache.org/jira/browse/HTTPCLIENT-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154177#comment-17154177
]
Mark Mielke commented on HTTPCLIENT-1995:
-----------------------------------------
This is what I found for java.net.URI vs HttpClient normalizeSyntax();
{noformat}
ORIGINAL: http://foo.com/%26path?name=value&name=value
java.net.URL normalize: http://foo.com/%26path?name=value&name=value
Apache HttpClient normalize: http://foo.com/&path?name=value&name=value
ORIGINAL: http://foo.com/%3F%26path?name=value&name=value
java.net.URL normalize: http://foo.com/%3F%26path?name=value&name=value
Apache HttpClient normalize: http://foo.com/%3F&path?name=value&name=value
ORIGINAL: http://foo.com/./foo/bar/.././grr%26path
java.net.URL normalize: http://foo.com/foo/grr%26path
Apache HttpClient normalize: http://foo.com/foo/grr&path
{noformat}
You can see that Java (OpenJDK 8 in this case) URI.normalize() does not
normalize at the character level. It removes the "." and ".." segments, but
preserves the reserved and unreserved characters.
You can see that Apache HttpClient URIUtils.normalizeSyntax() takes things a
step further, by not only removing the "." and ".." segments, but also by
normalizing the path segments by fully decoding them and fully encoding them,
using URIBuilder, which uses URI.
The code that I quoted above, specifies that it removes "." segments
explicitly, and it also references RFC 3986 as it's canon:
{code}
/**
* Removes dot segments according to RFC 3986, section 5.2.4 and
* Syntax-Based Normalization according to RFC 3986, section 6.2.2.
*
* @param uri the original URI
* @return the URI without dot segments
*
* @since 4.5
*/
public static URI normalizeSyntax(final URI uri) throws URISyntaxException {
{code}
I understand Oleg is using RFC 2396 as an initial reference, in which it
specifies "Normalization" only a single time:
{noformat}
6. URI Normalization and Equivalence
In many cases, different URI strings may actually identify the
identical resource. For example, the host names used in URL are
actually case insensitive, and the URL <http://www.XEROX.com> is
equivalent to <http://www.xerox.com>. In general, the rules for
equivalence and definition of a normal form, if any, are scheme
dependent. When a scheme uses elements of the common syntax, it will
also use the common syntax equivalence rules, namely that the scheme
and hostname are case insensitive and a URL with an explicit ":port",
where the port is the default for the scheme, is equivalent to one
where the port is elided.
{noformat}
This seems to permit a broad general interpretation of what normalization might
be done. Then, I understand how a very specific read of the reserved characters
and a contextual understanding of which components need the %-encoding to
identify semantic differences (for a subjective definition that might be
correct in many cases), and which ones do not, and this permits "normalization"
to take the form of "fully decode the path segments, and then fully re-encode
the path segments".
However, the code specifically references RFC 3986, *as it should*, given that
prior RFC are now obsolete, and should no longer be used as reference, and RFC
3986 has very particular statements about normalization.
Although you could argue whether or not an application "should" have semantic
differences with regard to a character such as `&` in a path segment, it is
likely that some applications do have such semantic differences, and the RFC is
designed to ensure inter-operation among both the applications we agree with,
and the applications we do not agree with, to best degree possible.
RFC 3986 is very clear:
{noformat}
The purpose of reserved characters is to provide a set of delimiting
characters that are distinguishable from other data within a URI.
URIs that differ in the replacement of a reserved character with its
corresponding percent-encoded octet are not equivalent. Percent-
encoding a reserved character, or decoding a percent-encoded octet
that corresponds to a reserved character, will change how the URI is
interpreted by most applications. Thus, characters in the reserved
set are protected from normalization and are therefore safe to be
used by scheme-specific and producer-specific algorithms for
delimiting data subcomponents within a URI.
{noformat}
Possibly this is a change from the prior RFC (although it seems more like a
clarification to me, given that the prior RFC didn't really specify
normalization rules).
Getting back to a point above - java.net.URI.normalize() does not decode and
re-encode the path segments, so java.net.URI - whether it implements an older
RFC or not - is not really to blame here. The choice to decode and re-encode
the path segments, above and beyond the original requirement to normalize "."
and ".." path segments, is an Apache HttpClient specific choice, that was
introduced in 4.5.7.
I better appreciate Oleg's position. However, I think it is a bit absent of
concern for real world impact. Inter-operation is not clean. This is why the
robustness principle was observed as a necessity.
> Percent-encoded ampersand in URI path not preserved
> ---------------------------------------------------
>
> Key: HTTPCLIENT-1995
> URL: https://issues.apache.org/jira/browse/HTTPCLIENT-1995
> Project: HttpComponents HttpClient
> Issue Type: Bug
> Components: HttpClient (classic)
> Affects Versions: 4.5.8, 4.5.9
> Environment: Linux Mint 19, OpenJDK 8
> Reporter: none_
> Priority: Major
>
> Starting with HttpClient 4.5.8, percent-encoded ampersand characters in URI
> path segments are not preserved any longer but written in decoded form to
> wire due to path normalization performed by URIUtils.rewriteURI(URI,
> HttpHost).
>
> According to RFC 3986 (page 11+), the ampersand character is a delimiter and
> thus needs to be percent-encoded when not used for this purpose. Path
> normalization, as performed by HttpClient v4.5.8+, creates a new URI that is
> not equivalent to the original URI and thus leads to misinterpretation on
> server/receiver side.
> ??URIs that differ in the replacement of a reserved character with its??
> ??corresponding percent-encoded octet are not equivalent. Percent-??
> ??encoding a reserved character, or decoding a percent-encoded octet??
> ??that corresponds to a reserved character, will change how the URI is??
> ??interpreted by most applications??.
>
> A very simple test case is as follows:
> {code:java}
> @Test
> public void testAmpersand() throws Throwable
> {
> final URI uri = new
> URI("http://example.org/some/path%26with%20percent/encoded/segments");
> final URI uri2 = URIUtils.rewriteURI(uri, null);
>
> Assert.assertEquals(uri, uri2);
> }
> {code}
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]