One design decision made during previous design discussion [1] is "Replacing FilePath with URI for resolving files paths". This has been brought back to dev@ mailing list in my previous email.
Comment [2] asked me to clarify the impact on Windows OS users because users have to specify the path in the URI format, such as: "file:///C:/home/input-*" "C:/home/" Using URIs in the API is to ensure Beam code is file systems agnostic. Another alternative is Java Path/File. It is used in the current IOChannelFactory API, and it works poorly. For example, Path throws when there are file scheme or asterisk in the path: new File("file:///C:/home/").toPath() throws in toPath(). Paths.get("C:/home/").resolve("output-*") throws in resolve(). any thoughts and suggestions are welcome. Thanks -- Pei --- [1]: https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit?disco=AAAAA30vtPU#heading=h.p3gc3colc2cs [2]: https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit?disco=AAAAA02O1cY On Tue, Dec 6, 2016 at 1:25 PM, Kenneth Knowles <k...@google.com.invalid> wrote: > Thanks for the thorough answers. It all sounds good to me. > > On Tue, Dec 6, 2016 at 12:57 PM, Pei He <pe...@google.com.invalid> wrote: > > > Thanks Kenn for the feedback and questions. > > > > I responded inline. > > > > On Mon, Dec 5, 2016 at 7:49 PM, Kenneth Knowles <k...@google.com.invalid> > > wrote: > > > > > I really like this document. It is easy to read and informative. Three > > > things not addressed by the document: > > > > > > 1. Major Beam use cases. I'm sure we have a few in the SDK that could > be > > > outlined in terms of the new API with pseudocode. > > > > > > (I am writing pseudocode directly with FileSystem interface to > demonstrate. > > However, clients will use the utility FileSystems. This is for us to > have a > > layer between the file systems providers' interface and the client > > interface. We can add utility functions to FileSystems for common use > > patterns as needed.) > > > > Major Beam use cases are the followings: > > A. FileBasedSource: > > // a. Get input URIs and file sizes from users provided specs. > > // Note: I updated the match() to be a bulk operation after I sent my > last > > email. > > List<MatchResult> results = match(specList); > > List<Metadata> inputMetadataList = FluentIterable.from(results) > > .transformAndConcat( > > new Function<MatchResult, Metadata>() { > > @Override > > public Iterable<Metadata> apply(MatchResult result) { > > return Arrays.asList(result.metadata()); > > }); > > > > // b. Read from a start offset to support the source splitting. > > SeekableByteChannel seekChannel = open(fileUri); > > seekChannel.position(source.getStartOffset()); > > seekChannel.read(...); > > > > B. FileBasedSink: > > // bulk rename temporary files to output files > > rename(tempUris, outputUris); > > > > C. General file operations: > > a. resolve paths > > b. create file to write, open file to read (for example in tests). > > c. bulk delete files/directories > > > > > > > > 2. Related work. How does this differ from other filesystem APIs and why? > > > > We need three sets of functionalities: > > 1. resolve paths. > > 2. read and write channels. > > 3. bulk files management operations(bulk delete/rename/match). > > > > And, they are available from Java nio, hadoop FileSystem APIs, and other > > standard library such as java.net.URI. > > > > Current IOChannelFactory interface uses Java nio for (1) and (2), and > > define its own interface for (3). > > > > In my redesign, I made the following choices: > > For (1), I replaced Java nio with URI, because it is standardized and > > precise and doesn't require additional implementation of a Path interface > > from file system providers. > > > > For (2), I kept the uses of Java nio (Writable/SeekableByteChannel), > since > > I don't see any things that need to improve and I don't see any better > > alternatives (hadoop's FSDataInput/OutputStream provide same > > functionalities, but requires additional dependencies). > > > > For (3), reasons that I didn't choose Java nio or hadoop are: > > 1. Beam needs bulk operations API for better performance, however Java > nio > > and hadoop FileSystems are single file based API. > > 2. Have APIs that are File systems agnostic. For example, we can use URI > > instead of Path. > > 3. Have APIs that are minimum, and easy to implement by file system > > providers. > > 4. Introducing less dependencies. > > 5. It is easy to build an adaptor based on Java nio or hadoop interfaces. > > > > 3. Discussion of non-Java languages. It would be good to know what > classes > > > in e.g. Python we might use in place of URI, SeekableByteChannel, etc. > > > > I don't want to mislead people here without a thorough investigation. You > > can see from your second question, that would require iterations on > design > > and prototyping. > > > > I didn't introduce any Java specific requirements in the redesign. > > Resolving paths, seeking with channels or streams, file management > > operations are languages independent. And, I pretty sure there are python > > libraries for that. > > > > However, I am happy to hear thoughts and get help from people working on > > the python sdk. > > > > > > > On Mon, Dec 5, 2016 at 4:41 PM, Pei He <pe...@google.com.invalid> > wrote: > > > > > > > I have received a lot of comments in "Part 1: IOChannelFactory > > > > Redesign" [1]. And, I have updated the design based on the feedback. > > > > > > > > Now, I feel it is close to be ready for implementation, and I would > > like > > > to > > > > summarize the changes: > > > > 1. Replaced FilePath with URI for resolving files paths. > > > > 2. Required match(String spec) to handle ambiguities in users > provided > > > > strings (see the match() java doc in the design doc for details). > > > > 3. Changed Metadata to use Future.get() paradigm, and removed > > > exception(). > > > > 4. Changed methods on FileSystem interface to be protected (visible > for > > > > implementors), and created FileSystems utility (visible for callers). > > > > 5. Simplified FileSystem interface by moving operation options, such > > as > > > > DeleteOptions, MatchOptions, to the FileSystems utility. > > > > 6. Simplified FileSystem interface by requiring certain behaviors, > such > > > as > > > > creating recursively, throwing for missing files. > > > > > > > > Any thoughts / feedback? > > > > -- > > > > Pei > > > > > > > > [1] > > > > https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id- > > > > XJsVG3qel2lhdKTknmZ_7M/edit# > > > > > > > > On Wed, Nov 30, 2016 at 1:32 PM, Pei He <pe...@google.com> wrote: > > > > > > > > > Thanks JB for the feedback. > > > > > > > > > > Yes, we should provide a hadoop.fs.FileSystem adaptor. As you said, > > it > > > > > will make a range of file system available in Beam. > > > > > > > > > > And, people can choose to implement BeamFileSystem directly to get > > the > > > > > best performance (For example, providing bulk operations.) > > > > > > > > > > -- > > > > > Pei > > > > > > > > > > > > > > > > > > > > On Tue, Nov 29, 2016 at 11:11 AM, Jean-Baptiste Onofré < > > > j...@nanthrax.net> > > > > > wrote: > > > > > > > > > >> Hi Pei, > > > > >> > > > > >> rethinking about that, I understand that the purpose of the Beam > > > > >> filesystem is to avoid to bring a bunch of dependencies into the > > core. > > > > That > > > > >> makes perfect sense. > > > > >> > > > > >> So, I agree that a Beam filesystem abstract is fine. > > > > >> > > > > >> My point is that we should provide a HadoopFilesystem > > extension/plugin > > > > >> for Beam filesystem asap: that would help us to support a good > range > > > of > > > > >> filesystems quickly. > > > > >> > > > > >> Just my $0.01 ;) > > > > >> > > > > >> Regards > > > > >> JB > > > > >> > > > > >> > > > > >> On 11/17/2016 08:18 PM, Pei He wrote: > > > > >> > > > > >>> Hi JB, > > > > >>> My proposals are based on the current IOChannelFactory, and how > > they > > > > are > > > > >>> used in FileBasedSink. > > > > >>> > > > > >>> Let's me spend more time to investigate Hadoop FileSystem > > interface. > > > > >>> -- > > > > >>> Pei > > > > >>> > > > > >>> On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré < > > > j...@nanthrax.net > > > > > > > > > >>> wrote: > > > > >>> > > > > >>> By the way, Pei, for the record: why introducing BeamFileSystem > and > > > not > > > > >>>> using the Hadoop FileSystem interface ? > > > > >>>> > > > > >>>> Thanks > > > > >>>> Regards > > > > >>>> JB > > > > >>>> > > > > >>>> On 11/17/2016 01:09 AM, Pei He wrote: > > > > >>>> > > > > >>>> Hi, > > > > >>>>> > > > > >>>>> I am working on BEAM-59 > > > > >>>>> <https://issues.apache.org/jira/browse/BEAM-59> > > "IOChannelFactory > > > > >>>>> redesign". The goals are: > > > > >>>>> > > > > >>>>> 1. Support file-based IOs (TextIO, AvorIO) with user-defined > file > > > > >>>>> system. > > > > >>>>> > > > > >>>>> 2. Support configuring any user-defined file system. > > > > >>>>> > > > > >>>>> And, I drafted the design proposal in two parts to address them > > in > > > > >>>>> order: > > > > >>>>> > > > > >>>>> Part 1: IOChannelFactory Redesign > > > > >>>>> <https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJ > > > > >>>>> sVG3qel2lhdKTknmZ_7M/edit#> > > > > >>>>> > > > > >>>>> Summary: > > > > >>>>> > > > > >>>>> Old API: WritableByteChannel create(String spec, String > > mimeType); > > > > >>>>> > > > > >>>>> New API: WritableByteChannel create(URI uri, CreateOptions > > > options); > > > > >>>>> > > > > >>>>> Noticeable proposed changes: > > > > >>>>> > > > > >>>>> > > > > >>>>> 1. > > > > >>>>> > > > > >>>>> Includes the options parameter in most methods to specify > > > > behaviors. > > > > >>>>> 2. > > > > >>>>> > > > > >>>>> Replace String with URI to include scheme for > > files/directories > > > > >>>>> locations. > > > > >>>>> 3. > > > > >>>>> > > > > >>>>> Require file systems to provide a SeekableByteChannel for > > read. > > > > >>>>> 4. > > > > >>>>> > > > > >>>>> Additional methods, such as getMetadata(), rename() e.t.c > > > > >>>>> > > > > >>>>> > > > > >>>>> Part 2: Configurable BeamFileSystem > > > > >>>>> <https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4 > > > > >>>>> q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs> > > > > >>>>> > > > > >>>>> Summary: > > > > >>>>> > > > > >>>>> Old API: IOChannelUtils.getFactory(glob).match(glob); > > > > >>>>> > > > > >>>>> New API: BeamFileSystems.getFileSystem(glob, > > config).match(glob); > > > > >>>>> > > > > >>>>> > > > > >>>>> Looking for comments and feedback. > > > > >>>>> > > > > >>>>> Thanks > > > > >>>>> > > > > >>>>> -- > > > > >>>>> > > > > >>>>> Pei > > > > >>>>> > > > > >>>>> > > > > >>>>> -- > > > > >>>> Jean-Baptiste Onofré > > > > >>>> jbono...@apache.org > > > > >>>> http://blog.nanthrax.net > > > > >>>> Talend - http://www.talend.com > > > > >>>> > > > > >>>> > > > > >>> > > > > >> -- > > > > >> Jean-Baptiste Onofré > > > > >> jbono...@apache.org > > > > >> http://blog.nanthrax.net > > > > >> Talend - http://www.talend.com > > > > >> > > > > > > > > > > > > > > > > > > > >