Re: RFC3986 comformance saga

Michael Osipov Wed, 29 Jul 2020 06:53:25 -0700

Am 2020-07-26 um 23:17 schrieb Mark Mielke:

On Sun, Jul 26, 2020 at 1:29 PM Oleg Kalnichevski <[email protected]> wrote:

Please find a few minutes to review the changes I am proposing to make
HttpCore partially RFC3986 conformant in the 5.1 branch:

https://github.com/apache/httpcomponents-core/pull/205

There are two major differences to the behavior of HttpCore 5.0.x and
HttpClient 4.5.x:

1. percent-encoding is applied to all unreserved characters whenever or
not some of those characters are explicitly permitted for use in URI
components, for instance `&` character in path segments.


Did you mean "is applied to all *reserved* characters whenever or not some
of these characters are explicitly permitted for use in URI components"?

I think this is a typo. Look at the code, all unreserved chars arepassed as-is.

I am trying to understand the impact from the test cases, as well as the
above statement, I'm lead to the conclusion that you are changing the "URI
producing" code, that will have a side effect of changing the normalization
code. Is this a correct assessment? If so, I'm wondering whether this leads
to the opposite problem?

RFC 3986 doesn't say that one should always percent-encode every reserved
character. As I think you pointed out, it says:

    URI producing applications should percent-encode data octets that
    correspond to characters in the reserved set *unless these characters
    are specifically allowed by the URI scheme to represent data in that
    component.*  If a reserved character is found in a URI component and
    no delimiting role is known for that character, then it must be
    interpreted as representing the data octet corresponding to that
    character's encoding in US-ASCII.


This "should ... unless ... specifically allowed by ..." means that they
don't need to be percent-encoded, and perhaps they shouldn't automatically
be encoded. "Should" is soft, in that URI producing either way should be
legal, but I think your original perspective that they "should not" is
correct as the default expectation.

I think it is useful to have URI producing classes that can specify whether
something is a "path component" or a "path", and in the case of a "path
component", it would automatically percent-encode the "/" reserved
character and such. If my read of the change is that this capability is
being introduced, then I do like it. Similarly, I think there should be a
way to add components without normalization or percent-encoding, although
for the most part people use StringBuilder or similar to achieve this end
today.

RFC 7230 does not deviate from RFC 3986:https://github.com/apache/httpcomponents-core/pull/205#issuecomment-665557772

There is no special handling for HTTP, as far as I understand.

The original concern for me wasn't about URI producing. The concern was
that URI normalization, which was being applied by default, was changing
the URI in a way that was not necessary, and that was not considered
"normalization" per RFC 3986 and the expectations of at least some people
and implementations:

    The purpose of reserved characters is to provide a set of delimiting
    characters that are distinguishable from other data within a URI.
    URIs that differ in the replacement of a reserved character with its
    corresponding percent-encoded octet are not equivalent.  Percent-
    encoding a reserved character, or decoding a percent-encoded octet
    that corresponds to a reserved character, will change how the URI is
    interpreted by most applications.  *Thus, characters in the reserved
    set are protected from normalization and are therefore safe to be
    used by scheme-specific and producer-specific algorithms for
    delimiting data subcomponents within a URI.*


Whether you or I might argue that they "should" or "should not" be
equivalent from a URI producing perspective, there is some expectation that
reasonable people and existing implementations which we need to
inter-operate with might disagree, and the formal documentation is now
explicit in RFC 3986 that the characters  in the reserved set are protected
from normalization and therefore safe to use by scheme-specific and
producer-specific algorithms for delimiting data subcomponents within a
URI. This isn't saying it "should" or "shouldn't" be done. It's saying that
it is recognized that it is being done, so any method of normalization
needs to be cautious.

The expectation for me, is that a user of an HTTP client, that receives a
redirect or other external source, should not automatically apply
unnecessary normalization that might change the meaning of the URI to the
producer. It means acknowledging that the producer is authoritative for
whether or not the reserved characters should be encoded for their use
case, and it is not permitted for Apache HTTP Client to transform the URI
to mean something different in this case.

I don't see any normalization code. It just encodes everything which isnot safe.

For example, according to RFC 3986 I would expect:

     http://acme.com/foo&bar     to be normalized to:
http://acme.com/foo&bar

And:

     http://acme.com/foo%26bar     to be normalized to:
http://acme.com/foo%26bar

Does the proposed code result in this expectation being met? Or does it
always percent-encode, leading to the opposite problem, that normalization
is now potentially breaking applications that presume the '&' will be left
intact in the path segment?


Why don't you try?

I still find RFC 3986 terribly inconsistent and confusing but I suppose

I am just not smart enough for it.


I wonder if URI producing and URI normalization are being conflated, and
this is the crucial point to resolving the confusion from both perspectives?

I think the code complies tohttps://tools.ietf.org/html/rfc7230#section-2.7.3


It will not encode "/mama/" to "/%6d%61%6d%61/".

M


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: RFC3986 comformance saga

Reply via email to