Thanks Kenn for the feedback and questions. I responded inline.
On Mon, Dec 5, 2016 at 7:49 PM, Kenneth Knowles <k...@google.com.invalid> wrote: > I really like this document. It is easy to read and informative. Three > things not addressed by the document: > > 1. Major Beam use cases. I'm sure we have a few in the SDK that could be > outlined in terms of the new API with pseudocode. (I am writing pseudocode directly with FileSystem interface to demonstrate. However, clients will use the utility FileSystems. This is for us to have a layer between the file systems providers' interface and the client interface. We can add utility functions to FileSystems for common use patterns as needed.) Major Beam use cases are the followings: A. FileBasedSource: // a. Get input URIs and file sizes from users provided specs. // Note: I updated the match() to be a bulk operation after I sent my last email. List<MatchResult> results = match(specList); List<Metadata> inputMetadataList = FluentIterable.from(results) .transformAndConcat( new Function<MatchResult, Metadata>() { @Override public Iterable<Metadata> apply(MatchResult result) { return Arrays.asList(result.metadata()); }); // b. Read from a start offset to support the source splitting. SeekableByteChannel seekChannel = open(fileUri); seekChannel.position(source.getStartOffset()); seekChannel.read(...); B. FileBasedSink: // bulk rename temporary files to output files rename(tempUris, outputUris); C. General file operations: a. resolve paths b. create file to write, open file to read (for example in tests). c. bulk delete files/directories 2. Related work. How does this differ from other filesystem APIs and why? We need three sets of functionalities: 1. resolve paths. 2. read and write channels. 3. bulk files management operations(bulk delete/rename/match). And, they are available from Java nio, hadoop FileSystem APIs, and other standard library such as java.net.URI. Current IOChannelFactory interface uses Java nio for (1) and (2), and define its own interface for (3). In my redesign, I made the following choices: For (1), I replaced Java nio with URI, because it is standardized and precise and doesn't require additional implementation of a Path interface from file system providers. For (2), I kept the uses of Java nio (Writable/SeekableByteChannel), since I don't see any things that need to improve and I don't see any better alternatives (hadoop's FSDataInput/OutputStream provide same functionalities, but requires additional dependencies). For (3), reasons that I didn't choose Java nio or hadoop are: 1. Beam needs bulk operations API for better performance, however Java nio and hadoop FileSystems are single file based API. 2. Have APIs that are File systems agnostic. For example, we can use URI instead of Path. 3. Have APIs that are minimum, and easy to implement by file system providers. 4. Introducing less dependencies. 5. It is easy to build an adaptor based on Java nio or hadoop interfaces. 3. Discussion of non-Java languages. It would be good to know what classes > in e.g. Python we might use in place of URI, SeekableByteChannel, etc. I don't want to mislead people here without a thorough investigation. You can see from your second question, that would require iterations on design and prototyping. I didn't introduce any Java specific requirements in the redesign. Resolving paths, seeking with channels or streams, file management operations are languages independent. And, I pretty sure there are python libraries for that. However, I am happy to hear thoughts and get help from people working on the python sdk. > On Mon, Dec 5, 2016 at 4:41 PM, Pei He <pe...@google.com.invalid> wrote: > > > I have received a lot of comments in "Part 1: IOChannelFactory > > Redesign" [1]. And, I have updated the design based on the feedback. > > > > Now, I feel it is close to be ready for implementation, and I would like > to > > summarize the changes: > > 1. Replaced FilePath with URI for resolving files paths. > > 2. Required match(String spec) to handle ambiguities in users provided > > strings (see the match() java doc in the design doc for details). > > 3. Changed Metadata to use Future.get() paradigm, and removed > exception(). > > 4. Changed methods on FileSystem interface to be protected (visible for > > implementors), and created FileSystems utility (visible for callers). > > 5. Simplified FileSystem interface by moving operation options, such as > > DeleteOptions, MatchOptions, to the FileSystems utility. > > 6. Simplified FileSystem interface by requiring certain behaviors, such > as > > creating recursively, throwing for missing files. > > > > Any thoughts / feedback? > > -- > > Pei > > > > [1] > > https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id- > > XJsVG3qel2lhdKTknmZ_7M/edit# > > > > On Wed, Nov 30, 2016 at 1:32 PM, Pei He <pe...@google.com> wrote: > > > > > Thanks JB for the feedback. > > > > > > Yes, we should provide a hadoop.fs.FileSystem adaptor. As you said, it > > > will make a range of file system available in Beam. > > > > > > And, people can choose to implement BeamFileSystem directly to get the > > > best performance (For example, providing bulk operations.) > > > > > > -- > > > Pei > > > > > > > > > > > > On Tue, Nov 29, 2016 at 11:11 AM, Jean-Baptiste Onofré < > j...@nanthrax.net> > > > wrote: > > > > > >> Hi Pei, > > >> > > >> rethinking about that, I understand that the purpose of the Beam > > >> filesystem is to avoid to bring a bunch of dependencies into the core. > > That > > >> makes perfect sense. > > >> > > >> So, I agree that a Beam filesystem abstract is fine. > > >> > > >> My point is that we should provide a HadoopFilesystem extension/plugin > > >> for Beam filesystem asap: that would help us to support a good range > of > > >> filesystems quickly. > > >> > > >> Just my $0.01 ;) > > >> > > >> Regards > > >> JB > > >> > > >> > > >> On 11/17/2016 08:18 PM, Pei He wrote: > > >> > > >>> Hi JB, > > >>> My proposals are based on the current IOChannelFactory, and how they > > are > > >>> used in FileBasedSink. > > >>> > > >>> Let's me spend more time to investigate Hadoop FileSystem interface. > > >>> -- > > >>> Pei > > >>> > > >>> On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré < > j...@nanthrax.net > > > > > >>> wrote: > > >>> > > >>> By the way, Pei, for the record: why introducing BeamFileSystem and > not > > >>>> using the Hadoop FileSystem interface ? > > >>>> > > >>>> Thanks > > >>>> Regards > > >>>> JB > > >>>> > > >>>> On 11/17/2016 01:09 AM, Pei He wrote: > > >>>> > > >>>> Hi, > > >>>>> > > >>>>> I am working on BEAM-59 > > >>>>> <https://issues.apache.org/jira/browse/BEAM-59> "IOChannelFactory > > >>>>> redesign". The goals are: > > >>>>> > > >>>>> 1. Support file-based IOs (TextIO, AvorIO) with user-defined file > > >>>>> system. > > >>>>> > > >>>>> 2. Support configuring any user-defined file system. > > >>>>> > > >>>>> And, I drafted the design proposal in two parts to address them in > > >>>>> order: > > >>>>> > > >>>>> Part 1: IOChannelFactory Redesign > > >>>>> <https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJ > > >>>>> sVG3qel2lhdKTknmZ_7M/edit#> > > >>>>> > > >>>>> Summary: > > >>>>> > > >>>>> Old API: WritableByteChannel create(String spec, String mimeType); > > >>>>> > > >>>>> New API: WritableByteChannel create(URI uri, CreateOptions > options); > > >>>>> > > >>>>> Noticeable proposed changes: > > >>>>> > > >>>>> > > >>>>> 1. > > >>>>> > > >>>>> Includes the options parameter in most methods to specify > > behaviors. > > >>>>> 2. > > >>>>> > > >>>>> Replace String with URI to include scheme for files/directories > > >>>>> locations. > > >>>>> 3. > > >>>>> > > >>>>> Require file systems to provide a SeekableByteChannel for read. > > >>>>> 4. > > >>>>> > > >>>>> Additional methods, such as getMetadata(), rename() e.t.c > > >>>>> > > >>>>> > > >>>>> Part 2: Configurable BeamFileSystem > > >>>>> <https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4 > > >>>>> q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs> > > >>>>> > > >>>>> Summary: > > >>>>> > > >>>>> Old API: IOChannelUtils.getFactory(glob).match(glob); > > >>>>> > > >>>>> New API: BeamFileSystems.getFileSystem(glob, config).match(glob); > > >>>>> > > >>>>> > > >>>>> Looking for comments and feedback. > > >>>>> > > >>>>> Thanks > > >>>>> > > >>>>> -- > > >>>>> > > >>>>> Pei > > >>>>> > > >>>>> > > >>>>> -- > > >>>> Jean-Baptiste Onofré > > >>>> jbono...@apache.org > > >>>> http://blog.nanthrax.net > > >>>> Talend - http://www.talend.com > > >>>> > > >>>> > > >>> > > >> -- > > >> Jean-Baptiste Onofré > > >> jbono...@apache.org > > >> http://blog.nanthrax.net > > >> Talend - http://www.talend.com > > >> > > > > > > > > >