Am 2020-07-26 um 23:17 schrieb Mark Mielke:
On Sun, Jul 26, 2020 at 1:29 PM Oleg Kalnichevski <[email protected]> wrote:
Please find a few minutes to review the changes I am proposing to make
HttpCore partially RFC3986 conformant in the 5.1 branch:
https://github.com/apache/httpcomponents-core/pull/205
There are two major differences to the behavior of HttpCore 5.0.x and
HttpClient 4.5.x:
1. percent-encoding is applied to all unreserved characters whenever or
not some of those characters are explicitly permitted for use in URI
components, for instance `&` character in path segments.
Did you mean "is applied to all *reserved* characters whenever or not some
of these characters are explicitly permitted for use in URI components"?
I think this is a typo. Look at the code, all unreserved chars are
passed as-is.
I am trying to understand the impact from the test cases, as well as the
above statement, I'm lead to the conclusion that you are changing the "URI
producing" code, that will have a side effect of changing the normalization
code. Is this a correct assessment? If so, I'm wondering whether this leads
to the opposite problem?
RFC 3986 doesn't say that one should always percent-encode every reserved
character. As I think you pointed out, it says:
URI producing applications should percent-encode data octets that
correspond to characters in the reserved set *unless these characters
are specifically allowed by the URI scheme to represent data in that
component.* If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.
This "should ... unless ... specifically allowed by ..." means that they
don't need to be percent-encoded, and perhaps they shouldn't automatically
be encoded. "Should" is soft, in that URI producing either way should be
legal, but I think your original perspective that they "should not" is
correct as the default expectation.
I think it is useful to have URI producing classes that can specify whether
something is a "path component" or a "path", and in the case of a "path
component", it would automatically percent-encode the "/" reserved
character and such. If my read of the change is that this capability is
being introduced, then I do like it. Similarly, I think there should be a
way to add components without normalization or percent-encoding, although
for the most part people use StringBuilder or similar to achieve this end
today.
RFC 7230 does not deviate from RFC 3986:
https://github.com/apache/httpcomponents-core/pull/205#issuecomment-665557772
There is no special handling for HTTP, as far as I understand.
The original concern for me wasn't about URI producing. The concern was
that URI normalization, which was being applied by default, was changing
the URI in a way that was not necessary, and that was not considered
"normalization" per RFC 3986 and the expectations of at least some people
and implementations:
The purpose of reserved characters is to provide a set of delimiting
characters that are distinguishable from other data within a URI.
URIs that differ in the replacement of a reserved character with its
corresponding percent-encoded octet are not equivalent. Percent-
encoding a reserved character, or decoding a percent-encoded octet
that corresponds to a reserved character, will change how the URI is
interpreted by most applications. *Thus, characters in the reserved
set are protected from normalization and are therefore safe to be
used by scheme-specific and producer-specific algorithms for
delimiting data subcomponents within a URI.*
Whether you or I might argue that they "should" or "should not" be
equivalent from a URI producing perspective, there is some expectation that
reasonable people and existing implementations which we need to
inter-operate with might disagree, and the formal documentation is now
explicit in RFC 3986 that the characters in the reserved set are protected
from normalization and therefore safe to use by scheme-specific and
producer-specific algorithms for delimiting data subcomponents within a
URI. This isn't saying it "should" or "shouldn't" be done. It's saying that
it is recognized that it is being done, so any method of normalization
needs to be cautious.
The expectation for me, is that a user of an HTTP client, that receives a
redirect or other external source, should not automatically apply
unnecessary normalization that might change the meaning of the URI to the
producer. It means acknowledging that the producer is authoritative for
whether or not the reserved characters should be encoded for their use
case, and it is not permitted for Apache HTTP Client to transform the URI
to mean something different in this case.
I don't see any normalization code. It just encodes everything which is
not safe.
For example, according to RFC 3986 I would expect:
http://acme.com/foo&bar to be normalized to:
http://acme.com/foo&bar
And:
http://acme.com/foo%26bar to be normalized to:
http://acme.com/foo%26bar
Does the proposed code result in this expectation being met? Or does it
always percent-encode, leading to the opposite problem, that normalization
is now potentially breaking applications that presume the '&' will be left
intact in the path segment?
Why don't you try?
I still find RFC 3986 terribly inconsistent and confusing but I suppose
I am just not smart enough for it.
I wonder if URI producing and URI normalization are being conflated, and
this is the crucial point to resolving the confusion from both perspectives?
I think the code complies to
https://tools.ietf.org/html/rfc7230#section-2.7.3
It will not encode "/mama/" to "/%6d%61%6d%61/".
M
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]