Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/10208#discussion_r47146241
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -331,6 +331,30 @@ private[spark] object Utils extends Logging {
}
/**
+ * A file name may contain some invalid url characters, such as " ".
This method will convert the
+ * file name to a raw path accepted by `java.net.URI(String)`.
+ *
+ * Note: the file name must not contain "/" or "\"
+ */
+ def encodeFileNameToURIRawPath(fileName: String): String = {
+ require(!fileName.contains("/") && !fileName.contains("\\"))
+ // `file` and `localhost` are not used. Just to prevent URI from
parsing `fileName` as
+ // scheme or host. The prefix "/" is required because URI doesn't
accept a relative path.
+ // We should remove it after we get the raw path.
+ new URI("file", null, "localhost", -1, "/" + fileName, null,
null).getRawPath.substring(1)
+ }
+
+ /**
+ * Get the file name from uri's raw path and decode it. The raw path of
uri must not end with "/".
+ */
+ def decodeFileNameInURI(uri: URI): String = {
+ val rawPath = uri.getRawPath
+ assert(!rawPath.endsWith("/"))
+ val rawFileName = rawPath.split("/").last
+ new URI("file:///" + rawFileName).getPath.substring(1)
--- End diff --
I see -- `URI` handles all that just fine as far as I can tell but there's
no way to tell that the final element in a path is a string containing a slash,
instead of two path elements separated by a slash, since the class has no way
to access path elements.
I think you don't need to go to a URI, back to string, back to URI here.
This may not even need a method for the one-liner: `new URI("file:///" +
rawFileName).getPath.substring(1)` used one place.
As an aside, some app servers will be picky about serving URIs with %2F in
some places, since this has been used in the past for some security exploits,
to disguise a cheeky request for some local file URL that would otherwise be
caught by (faulty) logic in the app that's not thinking about escaped sequences
and trying to handle raw paths. I think Tomcat won't for example. It may be
overkill but might even be worth considering conservatively rejecting such a
URI anyway.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]