How can I unsubscribe?
I will be away from this subject for sometime 
Will rejoin once I get back
Thanks colleagues
Happy holidays 

Sent from my iPhone

> On Dec 13, 2016, at 12:34 PM, Pei He <pe...@google.com.INVALID> wrote:
> 
> One design decision made during previous design discussion [1] is "Replacing
> FilePath with URI for resolving files paths". This has been brought back to
> dev@ mailing list in my previous email.
> 
> Comment [2] asked me to clarify the impact on Windows OS users because
> users have to specify the path in the URI format, such as:
> "file:///C:/home/input-*"
> "C:/home/"
> 
> Using URIs in the API is to ensure Beam code is file systems agnostic.
> 
> Another alternative is Java Path/File. It is used in the current
> IOChannelFactory API, and it works poorly. For example, Path throws when
> there are file scheme or asterisk in the path:
> new File("file:///C:/home/").toPath() throws in toPath().
> Paths.get("C:/home/").resolve("output-*") throws in resolve().
> 
> any thoughts and suggestions are welcome.
> 
> Thanks
> --
> Pei
> 
> ---
> [1]:
> https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit?disco=AAAAA30vtPU#heading=h.p3gc3colc2cs
> 
> [2]:
> https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit?disco=AAAAA02O1cY
> 
> On Tue, Dec 6, 2016 at 1:25 PM, Kenneth Knowles <k...@google.com.invalid>
> wrote:
> 
>> Thanks for the thorough answers. It all sounds good to me.
>> 
>>> On Tue, Dec 6, 2016 at 12:57 PM, Pei He <pe...@google.com.invalid> wrote:
>>> 
>>> Thanks Kenn for the feedback and questions.
>>> 
>>> I responded inline.
>>> 
>>> On Mon, Dec 5, 2016 at 7:49 PM, Kenneth Knowles <k...@google.com.invalid>
>>> wrote:
>>> 
>>>> I really like this document. It is easy to read and informative. Three
>>>> things not addressed by the document:
>>>> 
>>>> 1. Major Beam use cases. I'm sure we have a few in the SDK that could
>> be
>>>> outlined in terms of the new API with pseudocode.
>>> 
>>> 
>>> (I am writing pseudocode directly with FileSystem interface to
>> demonstrate.
>>> However, clients will use the utility FileSystems. This is for us to
>> have a
>>> layer between the file systems providers' interface and the client
>>> interface. We can add utility functions to FileSystems for common use
>>> patterns as needed.)
>>> 
>>> Major Beam use cases are the followings:
>>> A. FileBasedSource:
>>> // a. Get input URIs and file sizes from users provided specs.
>>> // Note: I updated the match() to be a bulk operation after I sent my
>> last
>>> email.
>>> List<MatchResult> results = match(specList);
>>> List<Metadata> inputMetadataList = FluentIterable.from(results)
>>>    .transformAndConcat(
>>>        new Function<MatchResult, Metadata>() {
>>>          @Override
>>>          public Iterable<Metadata> apply(MatchResult result) {
>>>            return Arrays.asList(result.metadata());
>>>          });
>>> 
>>> // b. Read from a start offset to support the source splitting.
>>> SeekableByteChannel seekChannel = open(fileUri);
>>> seekChannel.position(source.getStartOffset());
>>> seekChannel.read(...);
>>> 
>>> B. FileBasedSink:
>>> // bulk rename temporary files to output files
>>> rename(tempUris, outputUris);
>>> 
>>> C. General file operations:
>>> a. resolve paths
>>> b. create file to write, open file to read (for example in tests).
>>> c. bulk delete files/directories
>>> 
>>> 
>>> 
>>> 2. Related work. How does this differ from other filesystem APIs and why?
>>> 
>>> We need three sets of functionalities:
>>> 1. resolve paths.
>>> 2. read and write channels.
>>> 3. bulk files management operations(bulk delete/rename/match).
>>> 
>>> And, they are available from Java nio, hadoop FileSystem APIs, and other
>>> standard library such as java.net.URI.
>>> 
>>> Current IOChannelFactory interface uses Java nio for (1) and (2), and
>>> define its own interface for (3).
>>> 
>>> In my redesign, I made the following choices:
>>> For (1), I replaced Java nio with URI, because it is standardized and
>>> precise and doesn't require additional implementation of a Path interface
>>> from file system providers.
>>> 
>>> For (2), I kept the uses of Java nio (Writable/SeekableByteChannel),
>> since
>>> I don't see any things that need to improve and I don't see any better
>>> alternatives (hadoop's FSDataInput/OutputStream provide same
>>> functionalities, but requires additional dependencies).
>>> 
>>> For (3), reasons that I didn't choose Java nio or hadoop are:
>>> 1. Beam needs bulk operations API for better performance, however Java
>> nio
>>> and hadoop FileSystems are single file based API.
>>> 2. Have APIs that are File systems agnostic. For example, we can use URI
>>> instead of Path.
>>> 3. Have APIs that are minimum, and easy to implement by file system
>>> providers.
>>> 4. Introducing less dependencies.
>>> 5. It is easy to build an adaptor based on Java nio or hadoop interfaces.
>>> 
>>> 3. Discussion of non-Java languages. It would be good to know what
>> classes
>>>> in e.g. Python we might use in place of URI, SeekableByteChannel, etc.
>>> 
>>> I don't want to mislead people here without a thorough investigation. You
>>> can see from your second question, that would require iterations on
>> design
>>> and prototyping.
>>> 
>>> I didn't introduce any Java specific requirements in the redesign.
>>> Resolving paths, seeking with channels or streams, file management
>>> operations are languages independent. And, I pretty sure there are python
>>> libraries for that.
>>> 
>>> However, I am happy to hear thoughts and get help from people working on
>>> the python sdk.
>>> 
>>> 
>>>> On Mon, Dec 5, 2016 at 4:41 PM, Pei He <pe...@google.com.invalid>
>> wrote:
>>>> 
>>>>> I have received a lot of comments in "Part 1: IOChannelFactory
>>>>> Redesign" [1]. And, I have updated the design based on the feedback.
>>>>> 
>>>>> Now, I feel it is close to be ready for implementation, and I would
>>> like
>>>> to
>>>>> summarize the changes:
>>>>> 1. Replaced FilePath with URI for resolving files paths.
>>>>> 2. Required match(String spec) to handle ambiguities in users
>> provided
>>>>> strings (see the match() java doc in the design doc for details).
>>>>> 3. Changed Metadata to use Future.get() paradigm, and removed
>>>> exception().
>>>>> 4. Changed methods on FileSystem interface to be protected (visible
>> for
>>>>> implementors), and created FileSystems utility (visible for callers).
>>>>> 5.  Simplified FileSystem interface by moving operation options, such
>>> as
>>>>> DeleteOptions, MatchOptions, to the FileSystems utility.
>>>>> 6. Simplified FileSystem interface by requiring certain behaviors,
>> such
>>>> as
>>>>> creating recursively, throwing for missing files.
>>>>> 
>>>>> Any thoughts / feedback?
>>>>> --
>>>>> Pei
>>>>> 
>>>>> [1]
>>>>> https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-
>>>>> XJsVG3qel2lhdKTknmZ_7M/edit#
>>>>> 
>>>>>> On Wed, Nov 30, 2016 at 1:32 PM, Pei He <pe...@google.com> wrote:
>>>>>> 
>>>>>> Thanks JB for the feedback.
>>>>>> 
>>>>>> Yes, we should provide a hadoop.fs.FileSystem adaptor. As you said,
>>> it
>>>>>> will make a range of file system available in Beam.
>>>>>> 
>>>>>> And, people can choose to implement BeamFileSystem directly to get
>>> the
>>>>>> best performance (For example, providing bulk operations.)
>>>>>> 
>>>>>> --
>>>>>> Pei
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Nov 29, 2016 at 11:11 AM, Jean-Baptiste Onofré <
>>>> j...@nanthrax.net>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Pei,
>>>>>>> 
>>>>>>> rethinking about that, I understand that the purpose of the Beam
>>>>>>> filesystem is to avoid to bring a bunch of dependencies into the
>>> core.
>>>>> That
>>>>>>> makes perfect sense.
>>>>>>> 
>>>>>>> So, I agree that a Beam filesystem abstract is fine.
>>>>>>> 
>>>>>>> My point is that we should provide a HadoopFilesystem
>>> extension/plugin
>>>>>>> for Beam filesystem asap: that would help us to support a good
>> range
>>>> of
>>>>>>> filesystems quickly.
>>>>>>> 
>>>>>>> Just my $0.01 ;)
>>>>>>> 
>>>>>>> Regards
>>>>>>> JB
>>>>>>> 
>>>>>>> 
>>>>>>>> On 11/17/2016 08:18 PM, Pei He wrote:
>>>>>>>> 
>>>>>>>> Hi JB,
>>>>>>>> My proposals are based on the current IOChannelFactory, and how
>>> they
>>>>> are
>>>>>>>> used in FileBasedSink.
>>>>>>>> 
>>>>>>>> Let's me spend more time to investigate Hadoop FileSystem
>>> interface.
>>>>>>>> --
>>>>>>>> Pei
>>>>>>>> 
>>>>>>>> On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré <
>>>> j...@nanthrax.net
>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> By the way, Pei, for the record: why introducing BeamFileSystem
>> and
>>>> not
>>>>>>>>> using the Hadoop FileSystem interface ?
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Regards
>>>>>>>>> JB
>>>>>>>>> 
>>>>>>>>> On 11/17/2016 01:09 AM, Pei He wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I am working on BEAM-59
>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-59>
>>> "IOChannelFactory
>>>>>>>>>> redesign". The goals are:
>>>>>>>>>> 
>>>>>>>>>> 1. Support file-based IOs (TextIO, AvorIO) with user-defined
>> file
>>>>>>>>>> system.
>>>>>>>>>> 
>>>>>>>>>> 2. Support configuring any user-defined file system.
>>>>>>>>>> 
>>>>>>>>>> And, I drafted the design proposal in two parts to address them
>>> in
>>>>>>>>>> order:
>>>>>>>>>> 
>>>>>>>>>> Part 1: IOChannelFactory Redesign
>>>>>>>>>> <https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJ
>>>>>>>>>> sVG3qel2lhdKTknmZ_7M/edit#>
>>>>>>>>>> 
>>>>>>>>>> Summary:
>>>>>>>>>> 
>>>>>>>>>> Old API: WritableByteChannel create(String spec, String
>>> mimeType);
>>>>>>>>>> 
>>>>>>>>>> New API: WritableByteChannel create(URI uri, CreateOptions
>>>> options);
>>>>>>>>>> 
>>>>>>>>>> Noticeable proposed changes:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>   1.
>>>>>>>>>> 
>>>>>>>>>>   Includes the options parameter in most methods to specify
>>>>> behaviors.
>>>>>>>>>>   2.
>>>>>>>>>> 
>>>>>>>>>>   Replace String with URI to include scheme for
>>> files/directories
>>>>>>>>>>   locations.
>>>>>>>>>>   3.
>>>>>>>>>> 
>>>>>>>>>>   Require file systems to provide a SeekableByteChannel for
>>> read.
>>>>>>>>>>   4.
>>>>>>>>>> 
>>>>>>>>>>   Additional methods, such as getMetadata(), rename() e.t.c
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Part 2: Configurable BeamFileSystem
>>>>>>>>>> <https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4
>>>>>>>>>> q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>
>>>>>>>>>> 
>>>>>>>>>> Summary:
>>>>>>>>>> 
>>>>>>>>>> Old API: IOChannelUtils.getFactory(glob).match(glob);
>>>>>>>>>> 
>>>>>>>>>> New API: BeamFileSystems.getFileSystem(glob,
>>> config).match(glob);
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Looking for comments and feedback.
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> 
>>>>>>>>>> Pei
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>> Jean-Baptiste Onofré
>>>>>>>>> jbono...@apache.org
>>>>>>>>> http://blog.nanthrax.net
>>>>>>>>> Talend - http://www.talend.com
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> --
>>>>>>> Jean-Baptiste Onofré
>>>>>>> jbono...@apache.org
>>>>>>> http://blog.nanthrax.net
>>>>>>> Talend - http://www.talend.com
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 

Reply via email to