stevenzwu commented on code in PR #16174:
URL: https://github.com/apache/iceberg/pull/16174#discussion_r3290473847


##########
core/src/main/java/org/apache/iceberg/util/LocationUtil.java:
##########
@@ -57,4 +57,61 @@ public static String tableLocation(TableIdentifier 
tableIdentifier, boolean useU
       return tableIdentifier.name();
     }
   }
+
+  /**
+   * Returns true if the location contains a URI scheme (e.g. {@code s3:}, 
{@code hdfs:}, {@code
+   * file:}), per <a 
href="https://datatracker.ietf.org/doc/html/rfc3986#section-3.1";>RFC 3986
+   * section 3.1</a>.
+   */
+  private static boolean hasScheme(String location) {
+    for (int i = 0; i < location.length(); i += 1) {
+      char ch = location.charAt(i);
+      if (ch == ':') {
+        return i > 0;
+      }
+
+      if (!Character.isLetterOrDigit(ch) && ch != '+' && ch != '-' && ch != 
'.') {

Review Comment:
   `Character.isLetterOrDigit(char)` admits any BMP letter/digit category (CJK 
ideographs, Cyrillic, Arabic-Indic digits, etc.), which is broader than what 
the scheme grammar allows — RFC 3986 defines the URI grammar over US-ASCII per 
[§2](https://datatracker.ietf.org/doc/html/rfc3986#section-2):
   
   > The ABNF notation defines its terminal values to be non-negative integers 
(codepoints) based on the US-ASCII coded character set [ASCII]. Because a URI 
is a sequence of characters, we must invert that relation in order to 
understand the URI syntax. Therefore, the integer values used by the ABNF must 
be mapped back to their corresponding characters via US-ASCII in order to 
complete the syntax rules.
   
   And `ALPHA` / `DIGIT` in the scheme production at 
[§3.1](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) come from 
[RFC 5234 Appendix 
B.1](https://datatracker.ietf.org/doc/html/rfc5234#appendix-B.1), which 
restricts them to `%x41-5A`, `%x61-7A`, and `%x30-39`.
   
   In practice no Iceberg location places non-ASCII before `:`, so this is 
theoretical. Either note the deliberate liberal-accept relative to the RFC in 
the Javadoc, or tighten to inline ranges:
   
   ```java
   (ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z')
       || (i > 0 && ((ch >= '0' && ch <= '9') || ch == '+' || ch == '-' || ch 
== '.'))
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to