[ 
https://issues.apache.org/jira/browse/HTTPCLIENT-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154177#comment-17154177
 ] 

Mark Mielke commented on HTTPCLIENT-1995:
-----------------------------------------

This is what I found for java.net.URI vs HttpClient normalizeSyntax();

{noformat}
ORIGINAL:                    http://foo.com/%26path?name=value&name=value
java.net.URL normalize:      http://foo.com/%26path?name=value&name=value
Apache HttpClient normalize: http://foo.com/&path?name=value&name=value

ORIGINAL:                    http://foo.com/%3F%26path?name=value&name=value
java.net.URL normalize:      http://foo.com/%3F%26path?name=value&name=value
Apache HttpClient normalize: http://foo.com/%3F&path?name=value&name=value

ORIGINAL:                    http://foo.com/./foo/bar/.././grr%26path
java.net.URL normalize:      http://foo.com/foo/grr%26path
Apache HttpClient normalize: http://foo.com/foo/grr&path
{noformat}

You can see that Java (OpenJDK 8 in this case) URI.normalize() does not 
normalize at the character level. It removes the "." and ".." segments, but 
preserves the reserved and unreserved characters.

You can see that Apache HttpClient URIUtils.normalizeSyntax() takes things a 
step further, by not only removing the "." and ".." segments, but also by 
normalizing the path segments by fully decoding them and fully encoding them, 
using URIBuilder, which uses URI.

The code that I quoted above, specifies that it removes "." segments 
explicitly, and it also references RFC 3986 as it's canon:

{code}
    /**
     * Removes dot segments according to RFC 3986, section 5.2.4 and
     * Syntax-Based Normalization according to RFC 3986, section 6.2.2.
     *
     * @param uri the original URI
     * @return the URI without dot segments
     *
     * @since 4.5
     */
    public static URI normalizeSyntax(final URI uri) throws URISyntaxException {
{code}

I understand Oleg is using RFC 2396 as an initial reference, in which it 
specifies "Normalization" only a single time:

{noformat}
6. URI Normalization and Equivalence

   In many cases, different URI strings may actually identify the
   identical resource. For example, the host names used in URL are
   actually case insensitive, and the URL <http://www.XEROX.com> is
   equivalent to <http://www.xerox.com>. In general, the rules for
   equivalence and definition of a normal form, if any, are scheme
   dependent. When a scheme uses elements of the common syntax, it will
   also use the common syntax equivalence rules, namely that the scheme
   and hostname are case insensitive and a URL with an explicit ":port",
   where the port is the default for the scheme, is equivalent to one
   where the port is elided.
{noformat}

This seems to permit a broad general interpretation of what normalization might 
be done. Then, I understand how a very specific read of the reserved characters 
and a contextual understanding of which components need the %-encoding to 
identify semantic differences (for a subjective definition that might be 
correct in many cases), and which ones do not, and this permits "normalization" 
to take the form of "fully decode the path segments, and then fully re-encode 
the path segments".

However, the code specifically references RFC 3986, *as it should*, given that 
prior RFC are now obsolete, and should no longer be used as reference, and RFC 
3986 has very particular statements about normalization.

Although you could argue whether or not an application "should" have semantic 
differences with regard to a character such as `&` in a path segment, it is 
likely that some applications do have such semantic differences, and the RFC is 
designed to ensure inter-operation among both the applications we agree with, 
and the applications we do not agree with, to best degree possible.

RFC 3986 is very clear:

{noformat}
   The purpose of reserved characters is to provide a set of delimiting
   characters that are distinguishable from other data within a URI.
   URIs that differ in the replacement of a reserved character with its
   corresponding percent-encoded octet are not equivalent.  Percent-
   encoding a reserved character, or decoding a percent-encoded octet
   that corresponds to a reserved character, will change how the URI is
   interpreted by most applications.  Thus, characters in the reserved
   set are protected from normalization and are therefore safe to be
   used by scheme-specific and producer-specific algorithms for
   delimiting data subcomponents within a URI.
{noformat}

Possibly this is a change from the prior RFC (although it seems more like a 
clarification to me, given that the prior RFC didn't really specify 
normalization rules).

Getting back to a point above - java.net.URI.normalize() does not decode and 
re-encode the path segments, so java.net.URI - whether it implements an older 
RFC or not - is not really to blame here. The choice to decode and re-encode 
the path segments, above and beyond the original requirement to normalize "." 
and ".." path segments, is an Apache HttpClient specific choice, that was 
introduced in 4.5.7.

I better appreciate Oleg's position. However, I think it is a bit absent of 
concern for real world impact. Inter-operation is not clean. This is why the 
robustness principle was observed as a necessity.

> Percent-encoded ampersand in URI path not preserved
> ---------------------------------------------------
>
>                 Key: HTTPCLIENT-1995
>                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-1995
>             Project: HttpComponents HttpClient
>          Issue Type: Bug
>          Components: HttpClient (classic)
>    Affects Versions: 4.5.8, 4.5.9
>         Environment: Linux Mint 19, OpenJDK 8
>            Reporter: none_
>            Priority: Major
>
> Starting with HttpClient 4.5.8, percent-encoded ampersand characters in URI 
> path segments are not preserved any longer but written in decoded form to 
> wire due to path normalization performed by URIUtils.rewriteURI(URI, 
> HttpHost).
>  
> According to RFC 3986 (page 11+), the ampersand character is a delimiter and 
> thus needs to be percent-encoded when not used for this purpose. Path 
> normalization, as performed by HttpClient v4.5.8+, creates a new URI that is 
> not equivalent to the original URI and thus leads to misinterpretation on 
> server/receiver side.
> ??URIs that differ in the replacement of a reserved character with its??
> ??corresponding percent-encoded octet are not equivalent. Percent-??
> ??encoding a reserved character, or decoding a percent-encoded octet??
> ??that corresponds to a reserved character, will change how the URI is??
> ??interpreted by most applications??.
>   
> A very simple test case is as follows:
> {code:java}
> @Test
> public void testAmpersand() throws Throwable
> {
>     final URI uri = new 
> URI("http://example.org/some/path%26with%20percent/encoded/segments";);
>     final URI uri2 = URIUtils.rewriteURI(uri, null);
>         
>     Assert.assertEquals(uri, uri2);
> }
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to