dennishuo commented on code in PR #10877:
URL: https://github.com/apache/iceberg/pull/10877#discussion_r2586041642


##########
core/src/test/java/org/apache/iceberg/rest/TestRESTUtil.java:
##########
@@ -67,18 +70,24 @@ public void testStripTrailingSlash() {
     }
   }
 
-  @Test
-  public void testRoundTripUrlEncodeDecodeNamespace() {
+  @ParameterizedTest
+  @ValueSource(strings = {"%1F", "%2D", "%2E"})

Review Comment:
   I think there's a slight distinction that's getting blurred in the code - as 
I understand it, UTF-8 and URLEncoding are two separate/distinct concepts, and 
what we seem to mean in the Iceberg code is that these are the URLEncoded 
strings *after* apply UTF-8 encoding for the underlying bytes.
   
   My guess is that the [URLEncoder 
javadoc](https://docs.oracle.com/javase/8/docs/api/java/net/URLEncoder.html#encode-java.lang.String-java.lang.String-)
 makes it easy to mix this up since it says:
   
   `Note: The [World Wide Web Consortium 
Recommendation](http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars)
 states that UTF-8 should be used. Not doing so may introduce 
incompatibilities.`
   
   This statement *seems* to imply that the URLEncoding convention is itself 
fundamental to UTF-8, when it seems to be intending to just mean the underlying 
bytes encoding scheme should first be done as UTF-8 and then URL-encoding via 
"percent-escaping" is applied afterwards. This is more clear in its linked doc 
where it calls out the two separate steps:
   
   https://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars
   
       Although URIs do not contain non-ASCII values (see 
[[URI]](https://www.w3.org/TR/html40/references.html#ref-URI), section 2.1) 
authors sometimes specify them in attribute values expecting URIs (i.e., 
defined with [%URI;](https://www.w3.org/TR/html40/sgml/dtd.html#URI) in the 
[DTD](https://www.w3.org/TR/html40/sgml/dtd.html)). For instance, the following 
[href](https://www.w3.org/TR/html40/struct/links.html#adef-href) value is 
illegal:
   
       
       <A href="http://foo.org/Håkon";>...</A>
       We recommend that user agents adopt the following convention for 
handling non-ASCII characters in such cases:
   
       
       1. Represent each character in UTF-8 (see 
[[RFC2279]](https://www.w3.org/TR/html40/references.html#ref-RFC2279)) as one 
or more bytes.
       2. Escape these bytes with the URI escaping mechanism (i.e., by 
converting each byte to %HH, where HH is the hexadecimal notation of the byte 
value).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to