[ 
https://issues.apache.org/jira/browse/BEAM-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008833#comment-16008833
 ] 

Luke Cwik edited comment on BEAM-2283 at 5/12/17 10:32 PM:
-----------------------------------------------------------

One proposal is to:
* read().from("string path") represents an unescaped URI with no query or 
fragment component, potentially containing glob characters and '\' to escape 
glob characters
* add read().from(URI uri) for cases where users need to specify query/fragment 
components

Conversion from string to URI would be handled through a double escaping 
mechanism to support glob expressions:
* file:/my/path* would represent file:/my/path followed by the glob expression 
'*', this would be converted to the string file:/my/path%2A and then passed to 
a URI. FileSystem implementations would need to inspect the URI for escaped 
glob expressions.
* file:/my/path\* would represent file:/my/path* (note that this is a file 
named path* and not a glob expression), this would be converted to the string 
file:/my/path#%5C%2A and then passed to a URI. FileSystem implementations would 
need to inspect the URI, notice that it is not a glob expression and treat the 
unescaped path segment as a literal.

It would be important for FileSystem implementations to work on the URI and 
components and path segments individually converting to their own internal 
representation and failing if necessary.

Glob characters *, [], and ? would be understood and used by the internals of 
Apache Beam and glob conversion from *, [], ? to internal FileSystem glob 
representations would be FileSystem dependent.

This proposal has the benefits that:
* users have the minimal amount of escaping that they need to do (only escape 
the set of glob characters when the want things named with *, [], and ?)
* file:/my/path* is a canonical representation that most users would expect to 
represent file:/my/path followed by the glob *


was (Author: lcwik):
One proposal is to:
* read().from("string path") represents an unescaped URI with no query or 
fragment component, potentially containing glob characters and '\' to escape 
glob characters
* add read().from(URI uri) for cases where users need to specify query/fragment 
components

Conversion from string to URI would be handled through a double escaping 
mechanism to support glob expressions:
* file:/my/path* would represent file:/my/path followed by the glob expression 
'*', this would be converted to the string file:/my/path%2A and then passed to 
a URI. FileSystem implementations would need to inspect the URI for escaped 
glob expressions.
* file:/my/path\* would represent file:/my/path* (note that this is a file 
named path* and not a glob expression), this would be converted to the string 
file:/my/path#%5C%2A and then passed to a URI. FileSystem implementations would 
need to inspect the URI, notice that it is not a glob expression and treat the 
unescaped path segment as a literal.

It would be important for FileSystem implementations to work on the URI and 
components and path segments individually converting to their own internal 
representation and failing if necessary.

Glob characters *, [], and ? would be understood and used by the internals of 
Apache Beam and glob conversion from *, [], ? to internal FileSystem glob 
representations would be FileSystem dependent.

This proposal has the benefits that:
* users have the minimal amount of escaping that they need to do (only escape 
the set of glob characters)
* file:/my/path* is a canonical representation that most users would expect to 
represent file:/my/path followed by the glob *

> Consider using actual URIs instead of Strings/ResourceIds in relation to 
> FileSystems
> ------------------------------------------------------------------------------------
>
>                 Key: BEAM-2283
>                 URL: https://issues.apache.org/jira/browse/BEAM-2283
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-core, sdk-java-extensions, sdk-java-gcp, sdk-py
>            Reporter: Luke Cwik
>
> We treat things like URIs because we expect them to have a scheme component 
> and to be able to resolve a parent/child but fail to treat them as URIs in 
> the internal implementation since our string versions don't go through URI 
> normalization. This brings up a few issues:
> * The cost of implementing and maintaining ResourceIds instead of having 
> users use a standard URI implementation. This would just require FileSystems 
> to be able to take a string and give back a URI (to enable them to have 
> custom implementations in case they extend the concept of URIs with scheme 
> specific extensions).
> * The myriad of bugs that will come up because of improper usage of URI like 
> strings and the assumptions associated with them (like 
> https://issues.apache.org/jira/browse/BEAM-2277)
> Note that swapping to URIs adds complexity because:
> * Resolving URIs with glob expressions needs to be handled carefully
> * FileSystems may need to implement a complicated type instead of ResourceId.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to