[
https://issues.apache.org/jira/browse/BEAM-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008833#comment-16008833
]
Luke Cwik edited comment on BEAM-2283 at 5/12/17 10:32 PM:
-----------------------------------------------------------
One proposal is to:
* read().from("string path") represents an unescaped URI with no query or
fragment component, potentially containing glob characters and '\' to escape
glob characters
* add read().from(URI uri) for cases where users need to specify query/fragment
components
Conversion from string to URI would be handled through a double escaping
mechanism to support glob expressions:
* file:/my/path* would represent file:/my/path followed by the glob expression
'*', this would be converted to the string file:/my/path%2A and then passed to
a URI. FileSystem implementations would need to inspect the URI for escaped
glob expressions.
* file:/my/path\* would represent file:/my/path* (note that this is a file
named path* and not a glob expression), this would be converted to the string
file:/my/path#%5C%2A and then passed to a URI. FileSystem implementations would
need to inspect the URI, notice that it is not a glob expression and treat the
unescaped path segment as a literal.
It would be important for FileSystem implementations to work on the URI and
components and path segments individually converting to their own internal
representation and failing if necessary.
Glob characters *, [], and ? would be understood and used by the internals of
Apache Beam and glob conversion from *, [], ? to internal FileSystem glob
representations would be FileSystem dependent.
This proposal has the benefits that:
* users have the minimal amount of escaping that they need to do (only escape
the set of glob characters when the want things named with *, [], and ?)
* file:/my/path* is a canonical representation that most users would expect to
represent file:/my/path followed by the glob *
was (Author: lcwik):
One proposal is to:
* read().from("string path") represents an unescaped URI with no query or
fragment component, potentially containing glob characters and '\' to escape
glob characters
* add read().from(URI uri) for cases where users need to specify query/fragment
components
Conversion from string to URI would be handled through a double escaping
mechanism to support glob expressions:
* file:/my/path* would represent file:/my/path followed by the glob expression
'*', this would be converted to the string file:/my/path%2A and then passed to
a URI. FileSystem implementations would need to inspect the URI for escaped
glob expressions.
* file:/my/path\* would represent file:/my/path* (note that this is a file
named path* and not a glob expression), this would be converted to the string
file:/my/path#%5C%2A and then passed to a URI. FileSystem implementations would
need to inspect the URI, notice that it is not a glob expression and treat the
unescaped path segment as a literal.
It would be important for FileSystem implementations to work on the URI and
components and path segments individually converting to their own internal
representation and failing if necessary.
Glob characters *, [], and ? would be understood and used by the internals of
Apache Beam and glob conversion from *, [], ? to internal FileSystem glob
representations would be FileSystem dependent.
This proposal has the benefits that:
* users have the minimal amount of escaping that they need to do (only escape
the set of glob characters)
* file:/my/path* is a canonical representation that most users would expect to
represent file:/my/path followed by the glob *
> Consider using actual URIs instead of Strings/ResourceIds in relation to
> FileSystems
> ------------------------------------------------------------------------------------
>
> Key: BEAM-2283
> URL: https://issues.apache.org/jira/browse/BEAM-2283
> Project: Beam
> Issue Type: Improvement
> Components: sdk-java-core, sdk-java-extensions, sdk-java-gcp, sdk-py
> Reporter: Luke Cwik
>
> We treat things like URIs because we expect them to have a scheme component
> and to be able to resolve a parent/child but fail to treat them as URIs in
> the internal implementation since our string versions don't go through URI
> normalization. This brings up a few issues:
> * The cost of implementing and maintaining ResourceIds instead of having
> users use a standard URI implementation. This would just require FileSystems
> to be able to take a string and give back a URI (to enable them to have
> custom implementations in case they extend the concept of URIs with scheme
> specific extensions).
> * The myriad of bugs that will come up because of improper usage of URI like
> strings and the assumptions associated with them (like
> https://issues.apache.org/jira/browse/BEAM-2277)
> Note that swapping to URIs adds complexity because:
> * Resolving URIs with glob expressions needs to be handled carefully
> * FileSystems may need to implement a complicated type instead of ResourceId.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)