Re: RFC3986 comformance saga

Mark Mielke Sun, 26 Jul 2020 14:18:12 -0700

On Sun, Jul 26, 2020 at 1:29 PM Oleg Kalnichevski <[email protected]> wrote:


> Please find a few minutes to review the changes I am proposing to make
> HttpCore partially RFC3986 conformant in the 5.1 branch:
>
> https://github.com/apache/httpcomponents-core/pull/205
>
> There are two major differences to the behavior of HttpCore 5.0.x and
> HttpClient 4.5.x:
>
> 1. percent-encoding is applied to all unreserved characters whenever or
> not some of those characters are explicitly permitted for use in URI
> components, for instance `&` character in path segments.
>

Did you mean "is applied to all *reserved* characters whenever or not some
of these characters are explicitly permitted for use in URI components"?

I am trying to understand the impact from the test cases, as well as the
above statement, I'm lead to the conclusion that you are changing the "URI
producing" code, that will have a side effect of changing the normalization
code. Is this a correct assessment? If so, I'm wondering whether this leads
to the opposite problem?

RFC 3986 doesn't say that one should always percent-encode every reserved
character. As I think you pointed out, it says:

   URI producing applications should percent-encode data octets that
   correspond to characters in the reserved set *unless these characters
   are specifically allowed by the URI scheme to represent data in that
   component.*  If a reserved character is found in a URI component and
   no delimiting role is known for that character, then it must be
   interpreted as representing the data octet corresponding to that
   character's encoding in US-ASCII.


This "should ... unless ... specifically allowed by ..." means that they
don't need to be percent-encoded, and perhaps they shouldn't automatically
be encoded. "Should" is soft, in that URI producing either way should be
legal, but I think your original perspective that they "should not" is
correct as the default expectation.

I think it is useful to have URI producing classes that can specify whether
something is a "path component" or a "path", and in the case of a "path
component", it would automatically percent-encode the "/" reserved
character and such. If my read of the change is that this capability is
being introduced, then I do like it. Similarly, I think there should be a
way to add components without normalization or percent-encoding, although
for the most part people use StringBuilder or similar to achieve this end
today.

The original concern for me wasn't about URI producing. The concern was
that URI normalization, which was being applied by default, was changing
the URI in a way that was not necessary, and that was not considered
"normalization" per RFC 3986 and the expectations of at least some people
and implementations:

   The purpose of reserved characters is to provide a set of delimiting
   characters that are distinguishable from other data within a URI.
   URIs that differ in the replacement of a reserved character with its
   corresponding percent-encoded octet are not equivalent.  Percent-
   encoding a reserved character, or decoding a percent-encoded octet
   that corresponds to a reserved character, will change how the URI is
   interpreted by most applications.  *Thus, characters in the reserved
   set are protected from normalization and are therefore safe to be
   used by scheme-specific and producer-specific algorithms for
   delimiting data subcomponents within a URI.*


Whether you or I might argue that they "should" or "should not" be
equivalent from a URI producing perspective, there is some expectation that
reasonable people and existing implementations which we need to
inter-operate with might disagree, and the formal documentation is now
explicit in RFC 3986 that the characters  in the reserved set are protected
from normalization and therefore safe to use by scheme-specific and
producer-specific algorithms for delimiting data subcomponents within a
URI. This isn't saying it "should" or "shouldn't" be done. It's saying that
it is recognized that it is being done, so any method of normalization
needs to be cautious.

The expectation for me, is that a user of an HTTP client, that receives a
redirect or other external source, should not automatically apply
unnecessary normalization that might change the meaning of the URI to the
producer. It means acknowledging that the producer is authoritative for
whether or not the reserved characters should be encoded for their use
case, and it is not permitted for Apache HTTP Client to transform the URI
to mean something different in this case.

For example, according to RFC 3986 I would expect:

    http://acme.com/foo&bar     to be normalized to:
http://acme.com/foo&bar

And:

    http://acme.com/foo%26bar     to be normalized to:
http://acme.com/foo%26bar

Does the proposed code result in this expectation being met? Or does it
always percent-encode, leading to the opposite problem, that normalization
is now potentially breaking applications that presume the '&' will be left
intact in the path segment?


I still find RFC 3986 terribly inconsistent and confusing but I suppose
> I am just not smart enough for it.
>

I wonder if URI producing and URI normalization are being conflated, and
this is the crucial point to resolving the confusion from both perspectives?


-- 
Mark Mielke <[email protected]>

Re: RFC3986 comformance saga

Reply via email to