rdblue commented on code in PR #16174:
URL: https://github.com/apache/iceberg/pull/16174#discussion_r3198019046


##########
core/src/main/java/org/apache/iceberg/util/LocationUtil.java:
##########
@@ -57,4 +57,69 @@ public static String tableLocation(TableIdentifier 
tableIdentifier, boolean useU
       return tableIdentifier.name();
     }
   }
+
+  private static final int MAX_SCHEME_LENGTH = 10;
+
+  /**
+   * Returns true if the location contains a URI scheme (e.g. {@code s3:}, 
{@code hdfs:}, {@code
+   * file:}). Checks at most the first 10 characters for a {@code :} preceded 
by alphanumeric
+   * characters, per <a 
href="https://datatracker.ietf.org/doc/html/rfc3986#section-3.1";>RFC 3986
+   * section 3.1</a>.
+   */
+  private static boolean hasScheme(String location) {

Review Comment:
   I think it's reasonable to try to speed this up, but I'm concerned about 
being outside of the RFC. Section 3.1 says that the scheme can contain `+`, 
`-`, and `.`:
   
   ```
   scheme      = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
   ```
   
   Rather than only using `isLetterOrDigit`, I think this should also cover `ch 
== '+' || ch == '-' || ch == '.'`.
   
   There is also no length check. I'm torn on this because I don't think we are 
effectively limiting many cases by choosing a limit. At the same time, how 
often are we going to hit paths of only alphanumeric characters without `/`?
   
   The false-positive case is that we have a long relative path directory name 
that is not a scheme, since we think schemes are almost always < 10 chars. 
Initial directory names are almost always "data/" or "metadata/", which will 
exit the loop on `/` at index 4 or 8. By that logic, this only helps if we're 
checking very strange paths that are not absolute, not regular table paths, and 
have long alphanumeric directory names. Doesn't seem worth the potential bug 
for this optimization to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to