Thanks Kenn for the feedback and questions.

I responded inline.

On Mon, Dec 5, 2016 at 7:49 PM, Kenneth Knowles <k...@google.com.invalid>
wrote:

> I really like this document. It is easy to read and informative. Three
> things not addressed by the document:
>
> 1. Major Beam use cases. I'm sure we have a few in the SDK that could be
> outlined in terms of the new API with pseudocode.


(I am writing pseudocode directly with FileSystem interface to demonstrate.
However, clients will use the utility FileSystems. This is for us to have a
layer between the file systems providers' interface and the client
interface. We can add utility functions to FileSystems for common use
patterns as needed.)

Major Beam use cases are the followings:
A. FileBasedSource:
// a. Get input URIs and file sizes from users provided specs.
// Note: I updated the match() to be a bulk operation after I sent my last
email.
List<MatchResult> results = match(specList);
List<Metadata> inputMetadataList = FluentIterable.from(results)
    .transformAndConcat(
        new Function<MatchResult, Metadata>() {
          @Override
          public Iterable<Metadata> apply(MatchResult result) {
            return Arrays.asList(result.metadata());
          });

// b. Read from a start offset to support the source splitting.
SeekableByteChannel seekChannel = open(fileUri);
seekChannel.position(source.getStartOffset());
seekChannel.read(...);

B. FileBasedSink:
// bulk rename temporary files to output files
rename(tempUris, outputUris);

C. General file operations:
a. resolve paths
b. create file to write, open file to read (for example in tests).
c. bulk delete files/directories



2. Related work. How does this differ from other filesystem APIs and why?

We need three sets of functionalities:
1. resolve paths.
2. read and write channels.
3. bulk files management operations(bulk delete/rename/match).

And, they are available from Java nio, hadoop FileSystem APIs, and other
standard library such as java.net.URI.

Current IOChannelFactory interface uses Java nio for (1) and (2), and
define its own interface for (3).

In my redesign, I made the following choices:
For (1), I replaced Java nio with URI, because it is standardized and
precise and doesn't require additional implementation of a Path interface
from file system providers.

For (2), I kept the uses of Java nio (Writable/SeekableByteChannel), since
I don't see any things that need to improve and I don't see any better
alternatives (hadoop's FSDataInput/OutputStream provide same
functionalities, but requires additional dependencies).

For (3), reasons that I didn't choose Java nio or hadoop are:
1. Beam needs bulk operations API for better performance, however Java nio
and hadoop FileSystems are single file based API.
2. Have APIs that are File systems agnostic. For example, we can use URI
instead of Path.
3. Have APIs that are minimum, and easy to implement by file system
providers.
4. Introducing less dependencies.
5. It is easy to build an adaptor based on Java nio or hadoop interfaces.

3. Discussion of non-Java languages. It would be good to know what classes
> in e.g. Python we might use in place of URI, SeekableByteChannel, etc.

I don't want to mislead people here without a thorough investigation. You
can see from your second question, that would require iterations on design
and prototyping.

I didn't introduce any Java specific requirements in the redesign.
Resolving paths, seeking with channels or streams, file management
operations are languages independent. And, I pretty sure there are python
libraries for that.

However, I am happy to hear thoughts and get help from people working on
the python sdk.


> On Mon, Dec 5, 2016 at 4:41 PM, Pei He <pe...@google.com.invalid> wrote:
>
> > I have received a lot of comments in "Part 1: IOChannelFactory
> > Redesign" [1]. And, I have updated the design based on the feedback.
> >
> > Now, I feel it is close to be ready for implementation, and I would like
> to
> > summarize the changes:
> > 1. Replaced FilePath with URI for resolving files paths.
> > 2. Required match(String spec) to handle ambiguities in users provided
> > strings (see the match() java doc in the design doc for details).
> > 3. Changed Metadata to use Future.get() paradigm, and removed
> exception().
> > 4. Changed methods on FileSystem interface to be protected (visible for
> > implementors), and created FileSystems utility (visible for callers).
> > 5.  Simplified FileSystem interface by moving operation options, such as
> > DeleteOptions, MatchOptions, to the FileSystems utility.
> > 6. Simplified FileSystem interface by requiring certain behaviors, such
> as
> > creating recursively, throwing for missing files.
> >
> > Any thoughts / feedback?
> > --
> > Pei
> >
> > [1]
> > https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-
> > XJsVG3qel2lhdKTknmZ_7M/edit#
> >
> > On Wed, Nov 30, 2016 at 1:32 PM, Pei He <pe...@google.com> wrote:
> >
> > > Thanks JB for the feedback.
> > >
> > > Yes, we should provide a hadoop.fs.FileSystem adaptor. As you said, it
> > > will make a range of file system available in Beam.
> > >
> > > And, people can choose to implement BeamFileSystem directly to get the
> > > best performance (For example, providing bulk operations.)
> > >
> > > --
> > > Pei
> > >
> > >
> > >
> > > On Tue, Nov 29, 2016 at 11:11 AM, Jean-Baptiste Onofré <
> j...@nanthrax.net>
> > > wrote:
> > >
> > >> Hi Pei,
> > >>
> > >> rethinking about that, I understand that the purpose of the Beam
> > >> filesystem is to avoid to bring a bunch of dependencies into the core.
> > That
> > >> makes perfect sense.
> > >>
> > >> So, I agree that a Beam filesystem abstract is fine.
> > >>
> > >> My point is that we should provide a HadoopFilesystem extension/plugin
> > >> for Beam filesystem asap: that would help us to support a good range
> of
> > >> filesystems quickly.
> > >>
> > >> Just my $0.01 ;)
> > >>
> > >> Regards
> > >> JB
> > >>
> > >>
> > >> On 11/17/2016 08:18 PM, Pei He wrote:
> > >>
> > >>> Hi JB,
> > >>> My proposals are based on the current IOChannelFactory, and how they
> > are
> > >>> used in FileBasedSink.
> > >>>
> > >>> Let's me spend more time to investigate Hadoop FileSystem interface.
> > >>> --
> > >>> Pei
> > >>>
> > >>> On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> > >>> wrote:
> > >>>
> > >>> By the way, Pei, for the record: why introducing BeamFileSystem and
> not
> > >>>> using the Hadoop FileSystem interface ?
> > >>>>
> > >>>> Thanks
> > >>>> Regards
> > >>>> JB
> > >>>>
> > >>>> On 11/17/2016 01:09 AM, Pei He wrote:
> > >>>>
> > >>>> Hi,
> > >>>>>
> > >>>>> I am working on BEAM-59
> > >>>>> <https://issues.apache.org/jira/browse/BEAM-59> "IOChannelFactory
> > >>>>> redesign". The goals are:
> > >>>>>
> > >>>>> 1. Support file-based IOs (TextIO, AvorIO) with user-defined file
> > >>>>> system.
> > >>>>>
> > >>>>> 2. Support configuring any user-defined file system.
> > >>>>>
> > >>>>> And, I drafted the design proposal in two parts to address them in
> > >>>>> order:
> > >>>>>
> > >>>>> Part 1: IOChannelFactory Redesign
> > >>>>> <https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJ
> > >>>>> sVG3qel2lhdKTknmZ_7M/edit#>
> > >>>>>
> > >>>>> Summary:
> > >>>>>
> > >>>>> Old API: WritableByteChannel create(String spec, String mimeType);
> > >>>>>
> > >>>>> New API: WritableByteChannel create(URI uri, CreateOptions
> options);
> > >>>>>
> > >>>>> Noticeable proposed changes:
> > >>>>>
> > >>>>>
> > >>>>>    1.
> > >>>>>
> > >>>>>    Includes the options parameter in most methods to specify
> > behaviors.
> > >>>>>    2.
> > >>>>>
> > >>>>>    Replace String with URI to include scheme for files/directories
> > >>>>>    locations.
> > >>>>>    3.
> > >>>>>
> > >>>>>    Require file systems to provide a SeekableByteChannel for read.
> > >>>>>    4.
> > >>>>>
> > >>>>>    Additional methods, such as getMetadata(), rename() e.t.c
> > >>>>>
> > >>>>>
> > >>>>> Part 2: Configurable BeamFileSystem
> > >>>>> <https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4
> > >>>>> q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>
> > >>>>>
> > >>>>> Summary:
> > >>>>>
> > >>>>> Old API: IOChannelUtils.getFactory(glob).match(glob);
> > >>>>>
> > >>>>> New API: BeamFileSystems.getFileSystem(glob, config).match(glob);
> > >>>>>
> > >>>>>
> > >>>>> Looking for comments and feedback.
> > >>>>>
> > >>>>> Thanks
> > >>>>>
> > >>>>> --
> > >>>>>
> > >>>>> Pei
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>> Jean-Baptiste Onofré
> > >>>> jbono...@apache.org
> > >>>> http://blog.nanthrax.net
> > >>>> Talend - http://www.talend.com
> > >>>>
> > >>>>
> > >>>
> > >> --
> > >> Jean-Baptiste Onofré
> > >> jbono...@apache.org
> > >> http://blog.nanthrax.net
> > >> Talend - http://www.talend.com
> > >>
> > >
> > >
> >
>

Reply via email to