On Sun, Jul 26, 2020 at 1:29 PM Oleg Kalnichevski <[email protected]> wrote:
> Please find a few minutes to review the changes I am proposing to make > HttpCore partially RFC3986 conformant in the 5.1 branch: > > https://github.com/apache/httpcomponents-core/pull/205 > > There are two major differences to the behavior of HttpCore 5.0.x and > HttpClient 4.5.x: > > 1. percent-encoding is applied to all unreserved characters whenever or > not some of those characters are explicitly permitted for use in URI > components, for instance `&` character in path segments. > Did you mean "is applied to all *reserved* characters whenever or not some of these characters are explicitly permitted for use in URI components"? I am trying to understand the impact from the test cases, as well as the above statement, I'm lead to the conclusion that you are changing the "URI producing" code, that will have a side effect of changing the normalization code. Is this a correct assessment? If so, I'm wondering whether this leads to the opposite problem? RFC 3986 doesn't say that one should always percent-encode every reserved character. As I think you pointed out, it says: URI producing applications should percent-encode data octets that correspond to characters in the reserved set *unless these characters are specifically allowed by the URI scheme to represent data in that component.* If a reserved character is found in a URI component and no delimiting role is known for that character, then it must be interpreted as representing the data octet corresponding to that character's encoding in US-ASCII. This "should ... unless ... specifically allowed by ..." means that they don't need to be percent-encoded, and perhaps they shouldn't automatically be encoded. "Should" is soft, in that URI producing either way should be legal, but I think your original perspective that they "should not" is correct as the default expectation. I think it is useful to have URI producing classes that can specify whether something is a "path component" or a "path", and in the case of a "path component", it would automatically percent-encode the "/" reserved character and such. If my read of the change is that this capability is being introduced, then I do like it. Similarly, I think there should be a way to add components without normalization or percent-encoding, although for the most part people use StringBuilder or similar to achieve this end today. The original concern for me wasn't about URI producing. The concern was that URI normalization, which was being applied by default, was changing the URI in a way that was not necessary, and that was not considered "normalization" per RFC 3986 and the expectations of at least some people and implementations: The purpose of reserved characters is to provide a set of delimiting characters that are distinguishable from other data within a URI. URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent. Percent- encoding a reserved character, or decoding a percent-encoded octet that corresponds to a reserved character, will change how the URI is interpreted by most applications. *Thus, characters in the reserved set are protected from normalization and are therefore safe to be used by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI.* Whether you or I might argue that they "should" or "should not" be equivalent from a URI producing perspective, there is some expectation that reasonable people and existing implementations which we need to inter-operate with might disagree, and the formal documentation is now explicit in RFC 3986 that the characters in the reserved set are protected from normalization and therefore safe to use by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI. This isn't saying it "should" or "shouldn't" be done. It's saying that it is recognized that it is being done, so any method of normalization needs to be cautious. The expectation for me, is that a user of an HTTP client, that receives a redirect or other external source, should not automatically apply unnecessary normalization that might change the meaning of the URI to the producer. It means acknowledging that the producer is authoritative for whether or not the reserved characters should be encoded for their use case, and it is not permitted for Apache HTTP Client to transform the URI to mean something different in this case. For example, according to RFC 3986 I would expect: http://acme.com/foo&bar to be normalized to: http://acme.com/foo&bar And: http://acme.com/foo%26bar to be normalized to: http://acme.com/foo%26bar Does the proposed code result in this expectation being met? Or does it always percent-encode, leading to the opposite problem, that normalization is now potentially breaking applications that presume the '&' will be left intact in the path segment? I still find RFC 3986 terribly inconsistent and confusing but I suppose > I am just not smart enough for it. > I wonder if URI producing and URI normalization are being conflated, and this is the crucial point to resolving the confusion from both perspectives? -- Mark Mielke <[email protected]>
