How can I unsubscribe? I will be away from this subject for sometime Will rejoin once I get back Thanks colleagues Happy holidays
Sent from my iPhone > On Dec 13, 2016, at 12:34 PM, Pei He <pe...@google.com.INVALID> wrote: > > One design decision made during previous design discussion [1] is "Replacing > FilePath with URI for resolving files paths". This has been brought back to > dev@ mailing list in my previous email. > > Comment [2] asked me to clarify the impact on Windows OS users because > users have to specify the path in the URI format, such as: > "file:///C:/home/input-*" > "C:/home/" > > Using URIs in the API is to ensure Beam code is file systems agnostic. > > Another alternative is Java Path/File. It is used in the current > IOChannelFactory API, and it works poorly. For example, Path throws when > there are file scheme or asterisk in the path: > new File("file:///C:/home/").toPath() throws in toPath(). > Paths.get("C:/home/").resolve("output-*") throws in resolve(). > > any thoughts and suggestions are welcome. > > Thanks > -- > Pei > > --- > [1]: > https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit?disco=AAAAA30vtPU#heading=h.p3gc3colc2cs > > [2]: > https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit?disco=AAAAA02O1cY > > On Tue, Dec 6, 2016 at 1:25 PM, Kenneth Knowles <k...@google.com.invalid> > wrote: > >> Thanks for the thorough answers. It all sounds good to me. >> >>> On Tue, Dec 6, 2016 at 12:57 PM, Pei He <pe...@google.com.invalid> wrote: >>> >>> Thanks Kenn for the feedback and questions. >>> >>> I responded inline. >>> >>> On Mon, Dec 5, 2016 at 7:49 PM, Kenneth Knowles <k...@google.com.invalid> >>> wrote: >>> >>>> I really like this document. It is easy to read and informative. Three >>>> things not addressed by the document: >>>> >>>> 1. Major Beam use cases. I'm sure we have a few in the SDK that could >> be >>>> outlined in terms of the new API with pseudocode. >>> >>> >>> (I am writing pseudocode directly with FileSystem interface to >> demonstrate. >>> However, clients will use the utility FileSystems. This is for us to >> have a >>> layer between the file systems providers' interface and the client >>> interface. We can add utility functions to FileSystems for common use >>> patterns as needed.) >>> >>> Major Beam use cases are the followings: >>> A. FileBasedSource: >>> // a. Get input URIs and file sizes from users provided specs. >>> // Note: I updated the match() to be a bulk operation after I sent my >> last >>> email. >>> List<MatchResult> results = match(specList); >>> List<Metadata> inputMetadataList = FluentIterable.from(results) >>> .transformAndConcat( >>> new Function<MatchResult, Metadata>() { >>> @Override >>> public Iterable<Metadata> apply(MatchResult result) { >>> return Arrays.asList(result.metadata()); >>> }); >>> >>> // b. Read from a start offset to support the source splitting. >>> SeekableByteChannel seekChannel = open(fileUri); >>> seekChannel.position(source.getStartOffset()); >>> seekChannel.read(...); >>> >>> B. FileBasedSink: >>> // bulk rename temporary files to output files >>> rename(tempUris, outputUris); >>> >>> C. General file operations: >>> a. resolve paths >>> b. create file to write, open file to read (for example in tests). >>> c. bulk delete files/directories >>> >>> >>> >>> 2. Related work. How does this differ from other filesystem APIs and why? >>> >>> We need three sets of functionalities: >>> 1. resolve paths. >>> 2. read and write channels. >>> 3. bulk files management operations(bulk delete/rename/match). >>> >>> And, they are available from Java nio, hadoop FileSystem APIs, and other >>> standard library such as java.net.URI. >>> >>> Current IOChannelFactory interface uses Java nio for (1) and (2), and >>> define its own interface for (3). >>> >>> In my redesign, I made the following choices: >>> For (1), I replaced Java nio with URI, because it is standardized and >>> precise and doesn't require additional implementation of a Path interface >>> from file system providers. >>> >>> For (2), I kept the uses of Java nio (Writable/SeekableByteChannel), >> since >>> I don't see any things that need to improve and I don't see any better >>> alternatives (hadoop's FSDataInput/OutputStream provide same >>> functionalities, but requires additional dependencies). >>> >>> For (3), reasons that I didn't choose Java nio or hadoop are: >>> 1. Beam needs bulk operations API for better performance, however Java >> nio >>> and hadoop FileSystems are single file based API. >>> 2. Have APIs that are File systems agnostic. For example, we can use URI >>> instead of Path. >>> 3. Have APIs that are minimum, and easy to implement by file system >>> providers. >>> 4. Introducing less dependencies. >>> 5. It is easy to build an adaptor based on Java nio or hadoop interfaces. >>> >>> 3. Discussion of non-Java languages. It would be good to know what >> classes >>>> in e.g. Python we might use in place of URI, SeekableByteChannel, etc. >>> >>> I don't want to mislead people here without a thorough investigation. You >>> can see from your second question, that would require iterations on >> design >>> and prototyping. >>> >>> I didn't introduce any Java specific requirements in the redesign. >>> Resolving paths, seeking with channels or streams, file management >>> operations are languages independent. And, I pretty sure there are python >>> libraries for that. >>> >>> However, I am happy to hear thoughts and get help from people working on >>> the python sdk. >>> >>> >>>> On Mon, Dec 5, 2016 at 4:41 PM, Pei He <pe...@google.com.invalid> >> wrote: >>>> >>>>> I have received a lot of comments in "Part 1: IOChannelFactory >>>>> Redesign" [1]. And, I have updated the design based on the feedback. >>>>> >>>>> Now, I feel it is close to be ready for implementation, and I would >>> like >>>> to >>>>> summarize the changes: >>>>> 1. Replaced FilePath with URI for resolving files paths. >>>>> 2. Required match(String spec) to handle ambiguities in users >> provided >>>>> strings (see the match() java doc in the design doc for details). >>>>> 3. Changed Metadata to use Future.get() paradigm, and removed >>>> exception(). >>>>> 4. Changed methods on FileSystem interface to be protected (visible >> for >>>>> implementors), and created FileSystems utility (visible for callers). >>>>> 5. Simplified FileSystem interface by moving operation options, such >>> as >>>>> DeleteOptions, MatchOptions, to the FileSystems utility. >>>>> 6. Simplified FileSystem interface by requiring certain behaviors, >> such >>>> as >>>>> creating recursively, throwing for missing files. >>>>> >>>>> Any thoughts / feedback? >>>>> -- >>>>> Pei >>>>> >>>>> [1] >>>>> https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id- >>>>> XJsVG3qel2lhdKTknmZ_7M/edit# >>>>> >>>>>> On Wed, Nov 30, 2016 at 1:32 PM, Pei He <pe...@google.com> wrote: >>>>>> >>>>>> Thanks JB for the feedback. >>>>>> >>>>>> Yes, we should provide a hadoop.fs.FileSystem adaptor. As you said, >>> it >>>>>> will make a range of file system available in Beam. >>>>>> >>>>>> And, people can choose to implement BeamFileSystem directly to get >>> the >>>>>> best performance (For example, providing bulk operations.) >>>>>> >>>>>> -- >>>>>> Pei >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Nov 29, 2016 at 11:11 AM, Jean-Baptiste Onofré < >>>> j...@nanthrax.net> >>>>>> wrote: >>>>>> >>>>>>> Hi Pei, >>>>>>> >>>>>>> rethinking about that, I understand that the purpose of the Beam >>>>>>> filesystem is to avoid to bring a bunch of dependencies into the >>> core. >>>>> That >>>>>>> makes perfect sense. >>>>>>> >>>>>>> So, I agree that a Beam filesystem abstract is fine. >>>>>>> >>>>>>> My point is that we should provide a HadoopFilesystem >>> extension/plugin >>>>>>> for Beam filesystem asap: that would help us to support a good >> range >>>> of >>>>>>> filesystems quickly. >>>>>>> >>>>>>> Just my $0.01 ;) >>>>>>> >>>>>>> Regards >>>>>>> JB >>>>>>> >>>>>>> >>>>>>>> On 11/17/2016 08:18 PM, Pei He wrote: >>>>>>>> >>>>>>>> Hi JB, >>>>>>>> My proposals are based on the current IOChannelFactory, and how >>> they >>>>> are >>>>>>>> used in FileBasedSink. >>>>>>>> >>>>>>>> Let's me spend more time to investigate Hadoop FileSystem >>> interface. >>>>>>>> -- >>>>>>>> Pei >>>>>>>> >>>>>>>> On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré < >>>> j...@nanthrax.net >>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> By the way, Pei, for the record: why introducing BeamFileSystem >> and >>>> not >>>>>>>>> using the Hadoop FileSystem interface ? >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Regards >>>>>>>>> JB >>>>>>>>> >>>>>>>>> On 11/17/2016 01:09 AM, Pei He wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I am working on BEAM-59 >>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-59> >>> "IOChannelFactory >>>>>>>>>> redesign". The goals are: >>>>>>>>>> >>>>>>>>>> 1. Support file-based IOs (TextIO, AvorIO) with user-defined >> file >>>>>>>>>> system. >>>>>>>>>> >>>>>>>>>> 2. Support configuring any user-defined file system. >>>>>>>>>> >>>>>>>>>> And, I drafted the design proposal in two parts to address them >>> in >>>>>>>>>> order: >>>>>>>>>> >>>>>>>>>> Part 1: IOChannelFactory Redesign >>>>>>>>>> <https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJ >>>>>>>>>> sVG3qel2lhdKTknmZ_7M/edit#> >>>>>>>>>> >>>>>>>>>> Summary: >>>>>>>>>> >>>>>>>>>> Old API: WritableByteChannel create(String spec, String >>> mimeType); >>>>>>>>>> >>>>>>>>>> New API: WritableByteChannel create(URI uri, CreateOptions >>>> options); >>>>>>>>>> >>>>>>>>>> Noticeable proposed changes: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 1. >>>>>>>>>> >>>>>>>>>> Includes the options parameter in most methods to specify >>>>> behaviors. >>>>>>>>>> 2. >>>>>>>>>> >>>>>>>>>> Replace String with URI to include scheme for >>> files/directories >>>>>>>>>> locations. >>>>>>>>>> 3. >>>>>>>>>> >>>>>>>>>> Require file systems to provide a SeekableByteChannel for >>> read. >>>>>>>>>> 4. >>>>>>>>>> >>>>>>>>>> Additional methods, such as getMetadata(), rename() e.t.c >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Part 2: Configurable BeamFileSystem >>>>>>>>>> <https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4 >>>>>>>>>> q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs> >>>>>>>>>> >>>>>>>>>> Summary: >>>>>>>>>> >>>>>>>>>> Old API: IOChannelUtils.getFactory(glob).match(glob); >>>>>>>>>> >>>>>>>>>> New API: BeamFileSystems.getFileSystem(glob, >>> config).match(glob); >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Looking for comments and feedback. >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> Pei >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>> Jean-Baptiste Onofré >>>>>>>>> jbono...@apache.org >>>>>>>>> http://blog.nanthrax.net >>>>>>>>> Talend - http://www.talend.com >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> -- >>>>>>> Jean-Baptiste Onofré >>>>>>> jbono...@apache.org >>>>>>> http://blog.nanthrax.net >>>>>>> Talend - http://www.talend.com >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >>