Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

2016-12-13 Thread Kenneth Knowles
I don't think there is any conflict here.

On Tue, Dec 13, 2016 at 12:34 PM, Pei He  wrote:

> One design decision made during previous design discussion [1] is
> "Replacing
> FilePath with URI for resolving files paths". This has been brought back to
> dev@ mailing list in my previous email.
>

The direction of this argument, in my opinion, gets the burden of proof
wrong.

The original design document effectively proposed "instead of using URIs,
let's make a Beam-specific abstraction" and [1] is just the natural comment
"let's just use URI". This works for the internet, and gives interop with
essentially all code, so you need a very special reason not to do it (and
special cases generally manifest as custom URI schemes).

Comment [2] asked me to clarify the impact on Windows OS users because
> users have to specify the path in the URI format, such as:
> "file:///C:/home/input-*"
> "C:/home/"
>

It is not really true that users have to do this. For the command line, it
is the responsibility of the code that parses "--filesToStage
C:\my\windows\path". Users should absolutely be able to specify paths like
this on Windows, and it is not difficult and nothing your proposal needs to
solve.

With programmatic creation in Java code, the same principle applies: the
environment-specific String/File/Path should be converted to a URI at the
membrane. Making an API take a URI makes it completely obvious to a Java
programmer that if they have a String/File/Path they need to convert it
appropriately.

Kenn


> Using URIs in the API is to ensure Beam code is file systems agnostic.
>
> Another alternative is Java Path/File. It is used in the current
> IOChannelFactory API, and it works poorly. For example, Path throws when
> there are file scheme or asterisk in the path:
> new File("file:///C:/home/").toPath() throws in toPath().
> Paths.get("C:/home/").resolve("output-*") throws in resolve().
>
> any thoughts and suggestions are welcome.
>
> Thanks
> --
> Pei
>
> ---
> [1]:
> https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-
> XJsVG3qel2lhdKTknmZ_7M/edit?disco=A30vtPU#heading=h.p3gc3colc2cs
>
> [2]:
> https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-
> XJsVG3qel2lhdKTknmZ_7M/edit?disco=A02O1cY
>
> On Tue, Dec 6, 2016 at 1:25 PM, Kenneth Knowles 
> wrote:
>
> > Thanks for the thorough answers. It all sounds good to me.
> >
> > On Tue, Dec 6, 2016 at 12:57 PM, Pei He 
> wrote:
> >
> > > Thanks Kenn for the feedback and questions.
> > >
> > > I responded inline.
> > >
> > > On Mon, Dec 5, 2016 at 7:49 PM, Kenneth Knowles  >
> > > wrote:
> > >
> > > > I really like this document. It is easy to read and informative.
> Three
> > > > things not addressed by the document:
> > > >
> > > > 1. Major Beam use cases. I'm sure we have a few in the SDK that could
> > be
> > > > outlined in terms of the new API with pseudocode.
> > >
> > >
> > > (I am writing pseudocode directly with FileSystem interface to
> > demonstrate.
> > > However, clients will use the utility FileSystems. This is for us to
> > have a
> > > layer between the file systems providers' interface and the client
> > > interface. We can add utility functions to FileSystems for common use
> > > patterns as needed.)
> > >
> > > Major Beam use cases are the followings:
> > > A. FileBasedSource:
> > > // a. Get input URIs and file sizes from users provided specs.
> > > // Note: I updated the match() to be a bulk operation after I sent my
> > last
> > > email.
> > > List results = match(specList);
> > > List inputMetadataList = FluentIterable.from(results)
> > > .transformAndConcat(
> > > new Function() {
> > >   @Override
> > >   public Iterable apply(MatchResult result) {
> > > return Arrays.asList(result.metadata());
> > >   });
> > >
> > > // b. Read from a start offset to support the source splitting.
> > > SeekableByteChannel seekChannel = open(fileUri);
> > > seekChannel.position(source.getStartOffset());
> > > seekChannel.read(...);
> > >
> > > B. FileBasedSink:
> > > // bulk rename temporary files to output files
> > > rename(tempUris, outputUris);
> > >
> > > C. General file operations:
> > > a. resolve paths
> > > b. create file to write, open file to read (for example in tests).
> > > c. bulk delete files/directories
> > >
> > >
> > >
> > > 2. Related work. How does this differ from other filesystem APIs and
> why?
> > >
> > > We need three sets of functionalities:
> > > 1. resolve paths.
> > > 2. read and write channels.
> > > 3. bulk files management operations(bulk delete/rename/match).
> > >
> > > And, they are available from Java nio, hadoop FileSystem APIs, and
> other
> > > standard library such as java.net.URI.
> > >
> > > Current IOChannelFactory interface uses Java nio for (1) and (2), and
> > > define its own interface for (3).
> > >
> > > In my redesign, I 

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

2016-12-13 Thread Amir Bahmanyari
How can I unsubscribe?
I will be away from this subject for sometime 
Will rejoin once I get back
Thanks colleagues
Happy holidays 

Sent from my iPhone

> On Dec 13, 2016, at 12:34 PM, Pei He  wrote:
> 
> One design decision made during previous design discussion [1] is "Replacing
> FilePath with URI for resolving files paths". This has been brought back to
> dev@ mailing list in my previous email.
> 
> Comment [2] asked me to clarify the impact on Windows OS users because
> users have to specify the path in the URI format, such as:
> "file:///C:/home/input-*"
> "C:/home/"
> 
> Using URIs in the API is to ensure Beam code is file systems agnostic.
> 
> Another alternative is Java Path/File. It is used in the current
> IOChannelFactory API, and it works poorly. For example, Path throws when
> there are file scheme or asterisk in the path:
> new File("file:///C:/home/").toPath() throws in toPath().
> Paths.get("C:/home/").resolve("output-*") throws in resolve().
> 
> any thoughts and suggestions are welcome.
> 
> Thanks
> --
> Pei
> 
> ---
> [1]:
> https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit?disco=A30vtPU#heading=h.p3gc3colc2cs
> 
> [2]:
> https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit?disco=A02O1cY
> 
> On Tue, Dec 6, 2016 at 1:25 PM, Kenneth Knowles 
> wrote:
> 
>> Thanks for the thorough answers. It all sounds good to me.
>> 
>>> On Tue, Dec 6, 2016 at 12:57 PM, Pei He  wrote:
>>> 
>>> Thanks Kenn for the feedback and questions.
>>> 
>>> I responded inline.
>>> 
>>> On Mon, Dec 5, 2016 at 7:49 PM, Kenneth Knowles 
>>> wrote:
>>> 
 I really like this document. It is easy to read and informative. Three
 things not addressed by the document:
 
 1. Major Beam use cases. I'm sure we have a few in the SDK that could
>> be
 outlined in terms of the new API with pseudocode.
>>> 
>>> 
>>> (I am writing pseudocode directly with FileSystem interface to
>> demonstrate.
>>> However, clients will use the utility FileSystems. This is for us to
>> have a
>>> layer between the file systems providers' interface and the client
>>> interface. We can add utility functions to FileSystems for common use
>>> patterns as needed.)
>>> 
>>> Major Beam use cases are the followings:
>>> A. FileBasedSource:
>>> // a. Get input URIs and file sizes from users provided specs.
>>> // Note: I updated the match() to be a bulk operation after I sent my
>> last
>>> email.
>>> List results = match(specList);
>>> List inputMetadataList = FluentIterable.from(results)
>>>.transformAndConcat(
>>>new Function() {
>>>  @Override
>>>  public Iterable apply(MatchResult result) {
>>>return Arrays.asList(result.metadata());
>>>  });
>>> 
>>> // b. Read from a start offset to support the source splitting.
>>> SeekableByteChannel seekChannel = open(fileUri);
>>> seekChannel.position(source.getStartOffset());
>>> seekChannel.read(...);
>>> 
>>> B. FileBasedSink:
>>> // bulk rename temporary files to output files
>>> rename(tempUris, outputUris);
>>> 
>>> C. General file operations:
>>> a. resolve paths
>>> b. create file to write, open file to read (for example in tests).
>>> c. bulk delete files/directories
>>> 
>>> 
>>> 
>>> 2. Related work. How does this differ from other filesystem APIs and why?
>>> 
>>> We need three sets of functionalities:
>>> 1. resolve paths.
>>> 2. read and write channels.
>>> 3. bulk files management operations(bulk delete/rename/match).
>>> 
>>> And, they are available from Java nio, hadoop FileSystem APIs, and other
>>> standard library such as java.net.URI.
>>> 
>>> Current IOChannelFactory interface uses Java nio for (1) and (2), and
>>> define its own interface for (3).
>>> 
>>> In my redesign, I made the following choices:
>>> For (1), I replaced Java nio with URI, because it is standardized and
>>> precise and doesn't require additional implementation of a Path interface
>>> from file system providers.
>>> 
>>> For (2), I kept the uses of Java nio (Writable/SeekableByteChannel),
>> since
>>> I don't see any things that need to improve and I don't see any better
>>> alternatives (hadoop's FSDataInput/OutputStream provide same
>>> functionalities, but requires additional dependencies).
>>> 
>>> For (3), reasons that I didn't choose Java nio or hadoop are:
>>> 1. Beam needs bulk operations API for better performance, however Java
>> nio
>>> and hadoop FileSystems are single file based API.
>>> 2. Have APIs that are File systems agnostic. For example, we can use URI
>>> instead of Path.
>>> 3. Have APIs that are minimum, and easy to implement by file system
>>> providers.
>>> 4. Introducing less dependencies.
>>> 5. It is easy to build an adaptor based on Java nio or hadoop interfaces.
>>> 
>>> 3. Discussion of non-Java languages. It would be good 

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

2016-12-13 Thread Pei He
One design decision made during previous design discussion [1] is "Replacing
FilePath with URI for resolving files paths". This has been brought back to
dev@ mailing list in my previous email.

Comment [2] asked me to clarify the impact on Windows OS users because
users have to specify the path in the URI format, such as:
"file:///C:/home/input-*"
"C:/home/"

Using URIs in the API is to ensure Beam code is file systems agnostic.

Another alternative is Java Path/File. It is used in the current
IOChannelFactory API, and it works poorly. For example, Path throws when
there are file scheme or asterisk in the path:
new File("file:///C:/home/").toPath() throws in toPath().
Paths.get("C:/home/").resolve("output-*") throws in resolve().

any thoughts and suggestions are welcome.

Thanks
--
Pei

---
[1]:
https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit?disco=A30vtPU#heading=h.p3gc3colc2cs

[2]:
https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit?disco=A02O1cY

On Tue, Dec 6, 2016 at 1:25 PM, Kenneth Knowles 
wrote:

> Thanks for the thorough answers. It all sounds good to me.
>
> On Tue, Dec 6, 2016 at 12:57 PM, Pei He  wrote:
>
> > Thanks Kenn for the feedback and questions.
> >
> > I responded inline.
> >
> > On Mon, Dec 5, 2016 at 7:49 PM, Kenneth Knowles 
> > wrote:
> >
> > > I really like this document. It is easy to read and informative. Three
> > > things not addressed by the document:
> > >
> > > 1. Major Beam use cases. I'm sure we have a few in the SDK that could
> be
> > > outlined in terms of the new API with pseudocode.
> >
> >
> > (I am writing pseudocode directly with FileSystem interface to
> demonstrate.
> > However, clients will use the utility FileSystems. This is for us to
> have a
> > layer between the file systems providers' interface and the client
> > interface. We can add utility functions to FileSystems for common use
> > patterns as needed.)
> >
> > Major Beam use cases are the followings:
> > A. FileBasedSource:
> > // a. Get input URIs and file sizes from users provided specs.
> > // Note: I updated the match() to be a bulk operation after I sent my
> last
> > email.
> > List results = match(specList);
> > List inputMetadataList = FluentIterable.from(results)
> > .transformAndConcat(
> > new Function() {
> >   @Override
> >   public Iterable apply(MatchResult result) {
> > return Arrays.asList(result.metadata());
> >   });
> >
> > // b. Read from a start offset to support the source splitting.
> > SeekableByteChannel seekChannel = open(fileUri);
> > seekChannel.position(source.getStartOffset());
> > seekChannel.read(...);
> >
> > B. FileBasedSink:
> > // bulk rename temporary files to output files
> > rename(tempUris, outputUris);
> >
> > C. General file operations:
> > a. resolve paths
> > b. create file to write, open file to read (for example in tests).
> > c. bulk delete files/directories
> >
> >
> >
> > 2. Related work. How does this differ from other filesystem APIs and why?
> >
> > We need three sets of functionalities:
> > 1. resolve paths.
> > 2. read and write channels.
> > 3. bulk files management operations(bulk delete/rename/match).
> >
> > And, they are available from Java nio, hadoop FileSystem APIs, and other
> > standard library such as java.net.URI.
> >
> > Current IOChannelFactory interface uses Java nio for (1) and (2), and
> > define its own interface for (3).
> >
> > In my redesign, I made the following choices:
> > For (1), I replaced Java nio with URI, because it is standardized and
> > precise and doesn't require additional implementation of a Path interface
> > from file system providers.
> >
> > For (2), I kept the uses of Java nio (Writable/SeekableByteChannel),
> since
> > I don't see any things that need to improve and I don't see any better
> > alternatives (hadoop's FSDataInput/OutputStream provide same
> > functionalities, but requires additional dependencies).
> >
> > For (3), reasons that I didn't choose Java nio or hadoop are:
> > 1. Beam needs bulk operations API for better performance, however Java
> nio
> > and hadoop FileSystems are single file based API.
> > 2. Have APIs that are File systems agnostic. For example, we can use URI
> > instead of Path.
> > 3. Have APIs that are minimum, and easy to implement by file system
> > providers.
> > 4. Introducing less dependencies.
> > 5. It is easy to build an adaptor based on Java nio or hadoop interfaces.
> >
> > 3. Discussion of non-Java languages. It would be good to know what
> classes
> > > in e.g. Python we might use in place of URI, SeekableByteChannel, etc.
> >
> > I don't want to mislead people here without a thorough investigation. You
> > can see from your second question, that would require iterations on
> design
> > and prototyping.
> >
> > I didn't introduce any Java 

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

2016-12-06 Thread Kenneth Knowles
Thanks for the thorough answers. It all sounds good to me.

On Tue, Dec 6, 2016 at 12:57 PM, Pei He  wrote:

> Thanks Kenn for the feedback and questions.
>
> I responded inline.
>
> On Mon, Dec 5, 2016 at 7:49 PM, Kenneth Knowles 
> wrote:
>
> > I really like this document. It is easy to read and informative. Three
> > things not addressed by the document:
> >
> > 1. Major Beam use cases. I'm sure we have a few in the SDK that could be
> > outlined in terms of the new API with pseudocode.
>
>
> (I am writing pseudocode directly with FileSystem interface to demonstrate.
> However, clients will use the utility FileSystems. This is for us to have a
> layer between the file systems providers' interface and the client
> interface. We can add utility functions to FileSystems for common use
> patterns as needed.)
>
> Major Beam use cases are the followings:
> A. FileBasedSource:
> // a. Get input URIs and file sizes from users provided specs.
> // Note: I updated the match() to be a bulk operation after I sent my last
> email.
> List results = match(specList);
> List inputMetadataList = FluentIterable.from(results)
> .transformAndConcat(
> new Function() {
>   @Override
>   public Iterable apply(MatchResult result) {
> return Arrays.asList(result.metadata());
>   });
>
> // b. Read from a start offset to support the source splitting.
> SeekableByteChannel seekChannel = open(fileUri);
> seekChannel.position(source.getStartOffset());
> seekChannel.read(...);
>
> B. FileBasedSink:
> // bulk rename temporary files to output files
> rename(tempUris, outputUris);
>
> C. General file operations:
> a. resolve paths
> b. create file to write, open file to read (for example in tests).
> c. bulk delete files/directories
>
>
>
> 2. Related work. How does this differ from other filesystem APIs and why?
>
> We need three sets of functionalities:
> 1. resolve paths.
> 2. read and write channels.
> 3. bulk files management operations(bulk delete/rename/match).
>
> And, they are available from Java nio, hadoop FileSystem APIs, and other
> standard library such as java.net.URI.
>
> Current IOChannelFactory interface uses Java nio for (1) and (2), and
> define its own interface for (3).
>
> In my redesign, I made the following choices:
> For (1), I replaced Java nio with URI, because it is standardized and
> precise and doesn't require additional implementation of a Path interface
> from file system providers.
>
> For (2), I kept the uses of Java nio (Writable/SeekableByteChannel), since
> I don't see any things that need to improve and I don't see any better
> alternatives (hadoop's FSDataInput/OutputStream provide same
> functionalities, but requires additional dependencies).
>
> For (3), reasons that I didn't choose Java nio or hadoop are:
> 1. Beam needs bulk operations API for better performance, however Java nio
> and hadoop FileSystems are single file based API.
> 2. Have APIs that are File systems agnostic. For example, we can use URI
> instead of Path.
> 3. Have APIs that are minimum, and easy to implement by file system
> providers.
> 4. Introducing less dependencies.
> 5. It is easy to build an adaptor based on Java nio or hadoop interfaces.
>
> 3. Discussion of non-Java languages. It would be good to know what classes
> > in e.g. Python we might use in place of URI, SeekableByteChannel, etc.
>
> I don't want to mislead people here without a thorough investigation. You
> can see from your second question, that would require iterations on design
> and prototyping.
>
> I didn't introduce any Java specific requirements in the redesign.
> Resolving paths, seeking with channels or streams, file management
> operations are languages independent. And, I pretty sure there are python
> libraries for that.
>
> However, I am happy to hear thoughts and get help from people working on
> the python sdk.
>
>
> > On Mon, Dec 5, 2016 at 4:41 PM, Pei He  wrote:
> >
> > > I have received a lot of comments in "Part 1: IOChannelFactory
> > > Redesign" [1]. And, I have updated the design based on the feedback.
> > >
> > > Now, I feel it is close to be ready for implementation, and I would
> like
> > to
> > > summarize the changes:
> > > 1. Replaced FilePath with URI for resolving files paths.
> > > 2. Required match(String spec) to handle ambiguities in users provided
> > > strings (see the match() java doc in the design doc for details).
> > > 3. Changed Metadata to use Future.get() paradigm, and removed
> > exception().
> > > 4. Changed methods on FileSystem interface to be protected (visible for
> > > implementors), and created FileSystems utility (visible for callers).
> > > 5.  Simplified FileSystem interface by moving operation options, such
> as
> > > DeleteOptions, MatchOptions, to the FileSystems utility.
> > > 6. Simplified FileSystem interface by requiring certain behaviors, such
> > 

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

2016-12-06 Thread Pei He
Thanks Kenn for the feedback and questions.

I responded inline.

On Mon, Dec 5, 2016 at 7:49 PM, Kenneth Knowles 
wrote:

> I really like this document. It is easy to read and informative. Three
> things not addressed by the document:
>
> 1. Major Beam use cases. I'm sure we have a few in the SDK that could be
> outlined in terms of the new API with pseudocode.


(I am writing pseudocode directly with FileSystem interface to demonstrate.
However, clients will use the utility FileSystems. This is for us to have a
layer between the file systems providers' interface and the client
interface. We can add utility functions to FileSystems for common use
patterns as needed.)

Major Beam use cases are the followings:
A. FileBasedSource:
// a. Get input URIs and file sizes from users provided specs.
// Note: I updated the match() to be a bulk operation after I sent my last
email.
List results = match(specList);
List inputMetadataList = FluentIterable.from(results)
.transformAndConcat(
new Function() {
  @Override
  public Iterable apply(MatchResult result) {
return Arrays.asList(result.metadata());
  });

// b. Read from a start offset to support the source splitting.
SeekableByteChannel seekChannel = open(fileUri);
seekChannel.position(source.getStartOffset());
seekChannel.read(...);

B. FileBasedSink:
// bulk rename temporary files to output files
rename(tempUris, outputUris);

C. General file operations:
a. resolve paths
b. create file to write, open file to read (for example in tests).
c. bulk delete files/directories



2. Related work. How does this differ from other filesystem APIs and why?

We need three sets of functionalities:
1. resolve paths.
2. read and write channels.
3. bulk files management operations(bulk delete/rename/match).

And, they are available from Java nio, hadoop FileSystem APIs, and other
standard library such as java.net.URI.

Current IOChannelFactory interface uses Java nio for (1) and (2), and
define its own interface for (3).

In my redesign, I made the following choices:
For (1), I replaced Java nio with URI, because it is standardized and
precise and doesn't require additional implementation of a Path interface
from file system providers.

For (2), I kept the uses of Java nio (Writable/SeekableByteChannel), since
I don't see any things that need to improve and I don't see any better
alternatives (hadoop's FSDataInput/OutputStream provide same
functionalities, but requires additional dependencies).

For (3), reasons that I didn't choose Java nio or hadoop are:
1. Beam needs bulk operations API for better performance, however Java nio
and hadoop FileSystems are single file based API.
2. Have APIs that are File systems agnostic. For example, we can use URI
instead of Path.
3. Have APIs that are minimum, and easy to implement by file system
providers.
4. Introducing less dependencies.
5. It is easy to build an adaptor based on Java nio or hadoop interfaces.

3. Discussion of non-Java languages. It would be good to know what classes
> in e.g. Python we might use in place of URI, SeekableByteChannel, etc.

I don't want to mislead people here without a thorough investigation. You
can see from your second question, that would require iterations on design
and prototyping.

I didn't introduce any Java specific requirements in the redesign.
Resolving paths, seeking with channels or streams, file management
operations are languages independent. And, I pretty sure there are python
libraries for that.

However, I am happy to hear thoughts and get help from people working on
the python sdk.


> On Mon, Dec 5, 2016 at 4:41 PM, Pei He  wrote:
>
> > I have received a lot of comments in "Part 1: IOChannelFactory
> > Redesign" [1]. And, I have updated the design based on the feedback.
> >
> > Now, I feel it is close to be ready for implementation, and I would like
> to
> > summarize the changes:
> > 1. Replaced FilePath with URI for resolving files paths.
> > 2. Required match(String spec) to handle ambiguities in users provided
> > strings (see the match() java doc in the design doc for details).
> > 3. Changed Metadata to use Future.get() paradigm, and removed
> exception().
> > 4. Changed methods on FileSystem interface to be protected (visible for
> > implementors), and created FileSystems utility (visible for callers).
> > 5.  Simplified FileSystem interface by moving operation options, such as
> > DeleteOptions, MatchOptions, to the FileSystems utility.
> > 6. Simplified FileSystem interface by requiring certain behaviors, such
> as
> > creating recursively, throwing for missing files.
> >
> > Any thoughts / feedback?
> > --
> > Pei
> >
> > [1]
> > https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-
> > XJsVG3qel2lhdKTknmZ_7M/edit#
> >
> > On Wed, Nov 30, 2016 at 1:32 PM, Pei He  wrote:
> >
> > > Thanks JB for the feedback.
> > >
> > > Yes, we should 

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

2016-11-30 Thread Pei He
Thanks JB for the feedback.

Yes, we should provide a hadoop.fs.FileSystem adaptor. As you said, it will
make a range of file system available in Beam.

And, people can choose to implement BeamFileSystem directly to get the best
performance (For example, providing bulk operations.)

--
Pei



On Tue, Nov 29, 2016 at 11:11 AM, Jean-Baptiste Onofré 
wrote:

> Hi Pei,
>
> rethinking about that, I understand that the purpose of the Beam
> filesystem is to avoid to bring a bunch of dependencies into the core. That
> makes perfect sense.
>
> So, I agree that a Beam filesystem abstract is fine.
>
> My point is that we should provide a HadoopFilesystem extension/plugin for
> Beam filesystem asap: that would help us to support a good range of
> filesystems quickly.
>
> Just my $0.01 ;)
>
> Regards
> JB
>
>
> On 11/17/2016 08:18 PM, Pei He wrote:
>
>> Hi JB,
>> My proposals are based on the current IOChannelFactory, and how they are
>> used in FileBasedSink.
>>
>> Let's me spend more time to investigate Hadoop FileSystem interface.
>> --
>> Pei
>>
>> On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré 
>> wrote:
>>
>> By the way, Pei, for the record: why introducing BeamFileSystem and not
>>> using the Hadoop FileSystem interface ?
>>>
>>> Thanks
>>> Regards
>>> JB
>>>
>>> On 11/17/2016 01:09 AM, Pei He wrote:
>>>
>>> Hi,

 I am working on BEAM-59
  "IOChannelFactory
 redesign". The goals are:

 1. Support file-based IOs (TextIO, AvorIO) with user-defined file
 system.

 2. Support configuring any user-defined file system.

 And, I drafted the design proposal in two parts to address them in
 order:

 Part 1: IOChannelFactory Redesign
 

 Summary:

 Old API: WritableByteChannel create(String spec, String mimeType);

 New API: WritableByteChannel create(URI uri, CreateOptions options);

 Noticeable proposed changes:


1.

Includes the options parameter in most methods to specify behaviors.
2.

Replace String with URI to include scheme for files/directories
locations.
3.

Require file systems to provide a SeekableByteChannel for read.
4.

Additional methods, such as getMetadata(), rename() e.t.c


 Part 2: Configurable BeamFileSystem
 

 Summary:

 Old API: IOChannelUtils.getFactory(glob).match(glob);

 New API: BeamFileSystems.getFileSystem(glob, config).match(glob);


 Looking for comments and feedback.

 Thanks

 --

 Pei


 --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

2016-11-29 Thread Jean-Baptiste Onofré

Hi Pei,

rethinking about that, I understand that the purpose of the Beam 
filesystem is to avoid to bring a bunch of dependencies into the core. 
That makes perfect sense.


So, I agree that a Beam filesystem abstract is fine.

My point is that we should provide a HadoopFilesystem extension/plugin 
for Beam filesystem asap: that would help us to support a good range of 
filesystems quickly.


Just my $0.01 ;)

Regards
JB

On 11/17/2016 08:18 PM, Pei He wrote:

Hi JB,
My proposals are based on the current IOChannelFactory, and how they are
used in FileBasedSink.

Let's me spend more time to investigate Hadoop FileSystem interface.
--
Pei

On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré 
wrote:


By the way, Pei, for the record: why introducing BeamFileSystem and not
using the Hadoop FileSystem interface ?

Thanks
Regards
JB

On 11/17/2016 01:09 AM, Pei He wrote:


Hi,

I am working on BEAM-59
 "IOChannelFactory
redesign". The goals are:

1. Support file-based IOs (TextIO, AvorIO) with user-defined file system.

2. Support configuring any user-defined file system.

And, I drafted the design proposal in two parts to address them in order:

Part 1: IOChannelFactory Redesign


Summary:

Old API: WritableByteChannel create(String spec, String mimeType);

New API: WritableByteChannel create(URI uri, CreateOptions options);

Noticeable proposed changes:


   1.

   Includes the options parameter in most methods to specify behaviors.
   2.

   Replace String with URI to include scheme for files/directories
   locations.
   3.

   Require file systems to provide a SeekableByteChannel for read.
   4.

   Additional methods, such as getMetadata(), rename() e.t.c


Part 2: Configurable BeamFileSystem


Summary:

Old API: IOChannelUtils.getFactory(glob).match(glob);

New API: BeamFileSystems.getFileSystem(glob, config).match(glob);


Looking for comments and feedback.

Thanks

--

Pei



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

2016-11-17 Thread Pei He
Hi JB,
My proposals are based on the current IOChannelFactory, and how they are
used in FileBasedSink.

Let's me spend more time to investigate Hadoop FileSystem interface.
--
Pei

On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré 
wrote:

> By the way, Pei, for the record: why introducing BeamFileSystem and not
> using the Hadoop FileSystem interface ?
>
> Thanks
> Regards
> JB
>
> On 11/17/2016 01:09 AM, Pei He wrote:
>
>> Hi,
>>
>> I am working on BEAM-59
>>  "IOChannelFactory
>> redesign". The goals are:
>>
>> 1. Support file-based IOs (TextIO, AvorIO) with user-defined file system.
>>
>> 2. Support configuring any user-defined file system.
>>
>> And, I drafted the design proposal in two parts to address them in order:
>>
>> Part 1: IOChannelFactory Redesign
>> > sVG3qel2lhdKTknmZ_7M/edit#>
>>
>> Summary:
>>
>> Old API: WritableByteChannel create(String spec, String mimeType);
>>
>> New API: WritableByteChannel create(URI uri, CreateOptions options);
>>
>> Noticeable proposed changes:
>>
>>
>>1.
>>
>>Includes the options parameter in most methods to specify behaviors.
>>2.
>>
>>Replace String with URI to include scheme for files/directories
>>locations.
>>3.
>>
>>Require file systems to provide a SeekableByteChannel for read.
>>4.
>>
>>Additional methods, such as getMetadata(), rename() e.t.c
>>
>>
>> Part 2: Configurable BeamFileSystem
>> > q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>
>>
>> Summary:
>>
>> Old API: IOChannelUtils.getFactory(glob).match(glob);
>>
>> New API: BeamFileSystems.getFileSystem(glob, config).match(glob);
>>
>>
>> Looking for comments and feedback.
>>
>> Thanks
>>
>> --
>>
>> Pei
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

2016-11-17 Thread Jean-Baptiste Onofré
By the way, Pei, for the record: why introducing BeamFileSystem and not 
using the Hadoop FileSystem interface ?


Thanks
Regards
JB

On 11/17/2016 01:09 AM, Pei He wrote:

Hi,

I am working on BEAM-59
 "IOChannelFactory
redesign". The goals are:

1. Support file-based IOs (TextIO, AvorIO) with user-defined file system.

2. Support configuring any user-defined file system.

And, I drafted the design proposal in two parts to address them in order:

Part 1: IOChannelFactory Redesign


Summary:

Old API: WritableByteChannel create(String spec, String mimeType);

New API: WritableByteChannel create(URI uri, CreateOptions options);

Noticeable proposed changes:


   1.

   Includes the options parameter in most methods to specify behaviors.
   2.

   Replace String with URI to include scheme for files/directories
   locations.
   3.

   Require file systems to provide a SeekableByteChannel for read.
   4.

   Additional methods, such as getMetadata(), rename() e.t.c


Part 2: Configurable BeamFileSystem


Summary:

Old API: IOChannelUtils.getFactory(glob).match(glob);

New API: BeamFileSystems.getFileSystem(glob, config).match(glob);


Looking for comments and feedback.

Thanks

--

Pei



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com