[ https://issues.apache.org/jira/browse/BEAM-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008833#comment-16008833 ]
Luke Cwik edited comment on BEAM-2283 at 5/12/17 10:32 PM: ----------------------------------------------------------- One proposal is to: * read().from("string path") represents an unescaped URI with no query or fragment component, potentially containing glob characters and '\' to escape glob characters * add read().from(URI uri) for cases where users need to specify query/fragment components Conversion from string to URI would be handled through a double escaping mechanism to support glob expressions: * file:/my/path* would represent file:/my/path followed by the glob expression '*', this would be converted to the string file:/my/path%2A and then passed to a URI. FileSystem implementations would need to inspect the URI for escaped glob expressions. * file:/my/path\* would represent file:/my/path* (note that this is a file named path* and not a glob expression), this would be converted to the string file:/my/path#%5C%2A and then passed to a URI. FileSystem implementations would need to inspect the URI, notice that it is not a glob expression and treat the unescaped path segment as a literal. It would be important for FileSystem implementations to work on the URI and components and path segments individually converting to their own internal representation and failing if necessary. Glob characters *, [], and ? would be understood and used by the internals of Apache Beam and glob conversion from *, [], ? to internal FileSystem glob representations would be FileSystem dependent. This proposal has the benefits that: * users have the minimal amount of escaping that they need to do (only escape the set of glob characters when the want things named with *, [], and ?) * file:/my/path* is a canonical representation that most users would expect to represent file:/my/path followed by the glob * was (Author: lcwik): One proposal is to: * read().from("string path") represents an unescaped URI with no query or fragment component, potentially containing glob characters and '\' to escape glob characters * add read().from(URI uri) for cases where users need to specify query/fragment components Conversion from string to URI would be handled through a double escaping mechanism to support glob expressions: * file:/my/path* would represent file:/my/path followed by the glob expression '*', this would be converted to the string file:/my/path%2A and then passed to a URI. FileSystem implementations would need to inspect the URI for escaped glob expressions. * file:/my/path\* would represent file:/my/path* (note that this is a file named path* and not a glob expression), this would be converted to the string file:/my/path#%5C%2A and then passed to a URI. FileSystem implementations would need to inspect the URI, notice that it is not a glob expression and treat the unescaped path segment as a literal. It would be important for FileSystem implementations to work on the URI and components and path segments individually converting to their own internal representation and failing if necessary. Glob characters *, [], and ? would be understood and used by the internals of Apache Beam and glob conversion from *, [], ? to internal FileSystem glob representations would be FileSystem dependent. This proposal has the benefits that: * users have the minimal amount of escaping that they need to do (only escape the set of glob characters) * file:/my/path* is a canonical representation that most users would expect to represent file:/my/path followed by the glob * > Consider using actual URIs instead of Strings/ResourceIds in relation to > FileSystems > ------------------------------------------------------------------------------------ > > Key: BEAM-2283 > URL: https://issues.apache.org/jira/browse/BEAM-2283 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core, sdk-java-extensions, sdk-java-gcp, sdk-py > Reporter: Luke Cwik > > We treat things like URIs because we expect them to have a scheme component > and to be able to resolve a parent/child but fail to treat them as URIs in > the internal implementation since our string versions don't go through URI > normalization. This brings up a few issues: > * The cost of implementing and maintaining ResourceIds instead of having > users use a standard URI implementation. This would just require FileSystems > to be able to take a string and give back a URI (to enable them to have > custom implementations in case they extend the concept of URIs with scheme > specific extensions). > * The myriad of bugs that will come up because of improper usage of URI like > strings and the assumptions associated with them (like > https://issues.apache.org/jira/browse/BEAM-2277) > Note that swapping to URIs adds complexity because: > * Resolving URIs with glob expressions needs to be handled carefully > * FileSystems may need to implement a complicated type instead of ResourceId. -- This message was sent by Atlassian JIRA (v6.3.15#6346)