dennishuo commented on code in PR #10877:
URL: https://github.com/apache/iceberg/pull/10877#discussion_r2586041642
##########
core/src/test/java/org/apache/iceberg/rest/TestRESTUtil.java:
##########
@@ -67,18 +70,24 @@ public void testStripTrailingSlash() {
}
}
- @Test
- public void testRoundTripUrlEncodeDecodeNamespace() {
+ @ParameterizedTest
+ @ValueSource(strings = {"%1F", "%2D", "%2E"})
Review Comment:
I think there's a slight distinction that's getting blurred in the code - as
I understand it, UTF-8 and URLEncoding are two separate/distinct concepts, and
what we seem to mean in the Iceberg code is that these are the URLEncoded
strings *after* apply UTF-8 encoding for the underlying bytes.
My guess is that the [URLEncoder
javadoc](https://docs.oracle.com/javase/8/docs/api/java/net/URLEncoder.html#encode-java.lang.String-java.lang.String-)
makes it easy to mix this up since it says:
`Note: The [World Wide Web Consortium
Recommendation](http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars)
states that UTF-8 should be used. Not doing so may introduce
incompatibilities.`
This statement *seems* to imply that the URLEncoding convention is itself
fundamental to UTF-8, when it seems to be intending to just mean the underlying
bytes encoding scheme should first be done as UTF-8 and then URL-encoding via
"percent-escaping" is applied afterwards. This is more clear in its linked doc
where it calls out the two separate steps:
https://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars
Although URIs do not contain non-ASCII values (see
[[URI]](https://www.w3.org/TR/html40/references.html#ref-URI), section 2.1)
authors sometimes specify them in attribute values expecting URIs (i.e.,
defined with [%URI;](https://www.w3.org/TR/html40/sgml/dtd.html#URI) in the
[DTD](https://www.w3.org/TR/html40/sgml/dtd.html)). For instance, the following
[href](https://www.w3.org/TR/html40/struct/links.html#adef-href) value is
illegal:
<A href="http://foo.org/Håkon">...</A>
We recommend that user agents adopt the following convention for
handling non-ASCII characters in such cases:
1. Represent each character in UTF-8 (see
[[RFC2279]](https://www.w3.org/TR/html40/references.html#ref-RFC2279)) as one
or more bytes.
2. Escape these bytes with the URI escaping mechanism (i.e., by
converting each byte to %HH, where HH is the hexadecimal notation of the byte
value).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]