One design decision made during previous design discussion [1] is "Replacing
FilePath with URI for resolving files paths". This has been brought back to
dev@ mailing list in my previous email.

Comment [2] asked me to clarify the impact on Windows OS users because
users have to specify the path in the URI format, such as:
"file:///C:/home/input-*"
"C:/home/"

Using URIs in the API is to ensure Beam code is file systems agnostic.

Another alternative is Java Path/File. It is used in the current
IOChannelFactory API, and it works poorly. For example, Path throws when
there are file scheme or asterisk in the path:
new File("file:///C:/home/").toPath() throws in toPath().
Paths.get("C:/home/").resolve("output-*") throws in resolve().

any thoughts and suggestions are welcome.

Thanks
--
Pei

---
[1]:
https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit?disco=AAAAA30vtPU#heading=h.p3gc3colc2cs

[2]:
https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit?disco=AAAAA02O1cY

On Tue, Dec 6, 2016 at 1:25 PM, Kenneth Knowles <k...@google.com.invalid>
wrote:

> Thanks for the thorough answers. It all sounds good to me.
>
> On Tue, Dec 6, 2016 at 12:57 PM, Pei He <pe...@google.com.invalid> wrote:
>
> > Thanks Kenn for the feedback and questions.
> >
> > I responded inline.
> >
> > On Mon, Dec 5, 2016 at 7:49 PM, Kenneth Knowles <k...@google.com.invalid>
> > wrote:
> >
> > > I really like this document. It is easy to read and informative. Three
> > > things not addressed by the document:
> > >
> > > 1. Major Beam use cases. I'm sure we have a few in the SDK that could
> be
> > > outlined in terms of the new API with pseudocode.
> >
> >
> > (I am writing pseudocode directly with FileSystem interface to
> demonstrate.
> > However, clients will use the utility FileSystems. This is for us to
> have a
> > layer between the file systems providers' interface and the client
> > interface. We can add utility functions to FileSystems for common use
> > patterns as needed.)
> >
> > Major Beam use cases are the followings:
> > A. FileBasedSource:
> > // a. Get input URIs and file sizes from users provided specs.
> > // Note: I updated the match() to be a bulk operation after I sent my
> last
> > email.
> > List<MatchResult> results = match(specList);
> > List<Metadata> inputMetadataList = FluentIterable.from(results)
> >     .transformAndConcat(
> >         new Function<MatchResult, Metadata>() {
> >           @Override
> >           public Iterable<Metadata> apply(MatchResult result) {
> >             return Arrays.asList(result.metadata());
> >           });
> >
> > // b. Read from a start offset to support the source splitting.
> > SeekableByteChannel seekChannel = open(fileUri);
> > seekChannel.position(source.getStartOffset());
> > seekChannel.read(...);
> >
> > B. FileBasedSink:
> > // bulk rename temporary files to output files
> > rename(tempUris, outputUris);
> >
> > C. General file operations:
> > a. resolve paths
> > b. create file to write, open file to read (for example in tests).
> > c. bulk delete files/directories
> >
> >
> >
> > 2. Related work. How does this differ from other filesystem APIs and why?
> >
> > We need three sets of functionalities:
> > 1. resolve paths.
> > 2. read and write channels.
> > 3. bulk files management operations(bulk delete/rename/match).
> >
> > And, they are available from Java nio, hadoop FileSystem APIs, and other
> > standard library such as java.net.URI.
> >
> > Current IOChannelFactory interface uses Java nio for (1) and (2), and
> > define its own interface for (3).
> >
> > In my redesign, I made the following choices:
> > For (1), I replaced Java nio with URI, because it is standardized and
> > precise and doesn't require additional implementation of a Path interface
> > from file system providers.
> >
> > For (2), I kept the uses of Java nio (Writable/SeekableByteChannel),
> since
> > I don't see any things that need to improve and I don't see any better
> > alternatives (hadoop's FSDataInput/OutputStream provide same
> > functionalities, but requires additional dependencies).
> >
> > For (3), reasons that I didn't choose Java nio or hadoop are:
> > 1. Beam needs bulk operations API for better performance, however Java
> nio
> > and hadoop FileSystems are single file based API.
> > 2. Have APIs that are File systems agnostic. For example, we can use URI
> > instead of Path.
> > 3. Have APIs that are minimum, and easy to implement by file system
> > providers.
> > 4. Introducing less dependencies.
> > 5. It is easy to build an adaptor based on Java nio or hadoop interfaces.
> >
> > 3. Discussion of non-Java languages. It would be good to know what
> classes
> > > in e.g. Python we might use in place of URI, SeekableByteChannel, etc.
> >
> > I don't want to mislead people here without a thorough investigation. You
> > can see from your second question, that would require iterations on
> design
> > and prototyping.
> >
> > I didn't introduce any Java specific requirements in the redesign.
> > Resolving paths, seeking with channels or streams, file management
> > operations are languages independent. And, I pretty sure there are python
> > libraries for that.
> >
> > However, I am happy to hear thoughts and get help from people working on
> > the python sdk.
> >
> >
> > > On Mon, Dec 5, 2016 at 4:41 PM, Pei He <pe...@google.com.invalid>
> wrote:
> > >
> > > > I have received a lot of comments in "Part 1: IOChannelFactory
> > > > Redesign" [1]. And, I have updated the design based on the feedback.
> > > >
> > > > Now, I feel it is close to be ready for implementation, and I would
> > like
> > > to
> > > > summarize the changes:
> > > > 1. Replaced FilePath with URI for resolving files paths.
> > > > 2. Required match(String spec) to handle ambiguities in users
> provided
> > > > strings (see the match() java doc in the design doc for details).
> > > > 3. Changed Metadata to use Future.get() paradigm, and removed
> > > exception().
> > > > 4. Changed methods on FileSystem interface to be protected (visible
> for
> > > > implementors), and created FileSystems utility (visible for callers).
> > > > 5.  Simplified FileSystem interface by moving operation options, such
> > as
> > > > DeleteOptions, MatchOptions, to the FileSystems utility.
> > > > 6. Simplified FileSystem interface by requiring certain behaviors,
> such
> > > as
> > > > creating recursively, throwing for missing files.
> > > >
> > > > Any thoughts / feedback?
> > > > --
> > > > Pei
> > > >
> > > > [1]
> > > > https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-
> > > > XJsVG3qel2lhdKTknmZ_7M/edit#
> > > >
> > > > On Wed, Nov 30, 2016 at 1:32 PM, Pei He <pe...@google.com> wrote:
> > > >
> > > > > Thanks JB for the feedback.
> > > > >
> > > > > Yes, we should provide a hadoop.fs.FileSystem adaptor. As you said,
> > it
> > > > > will make a range of file system available in Beam.
> > > > >
> > > > > And, people can choose to implement BeamFileSystem directly to get
> > the
> > > > > best performance (For example, providing bulk operations.)
> > > > >
> > > > > --
> > > > > Pei
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Nov 29, 2016 at 11:11 AM, Jean-Baptiste Onofré <
> > > j...@nanthrax.net>
> > > > > wrote:
> > > > >
> > > > >> Hi Pei,
> > > > >>
> > > > >> rethinking about that, I understand that the purpose of the Beam
> > > > >> filesystem is to avoid to bring a bunch of dependencies into the
> > core.
> > > > That
> > > > >> makes perfect sense.
> > > > >>
> > > > >> So, I agree that a Beam filesystem abstract is fine.
> > > > >>
> > > > >> My point is that we should provide a HadoopFilesystem
> > extension/plugin
> > > > >> for Beam filesystem asap: that would help us to support a good
> range
> > > of
> > > > >> filesystems quickly.
> > > > >>
> > > > >> Just my $0.01 ;)
> > > > >>
> > > > >> Regards
> > > > >> JB
> > > > >>
> > > > >>
> > > > >> On 11/17/2016 08:18 PM, Pei He wrote:
> > > > >>
> > > > >>> Hi JB,
> > > > >>> My proposals are based on the current IOChannelFactory, and how
> > they
> > > > are
> > > > >>> used in FileBasedSink.
> > > > >>>
> > > > >>> Let's me spend more time to investigate Hadoop FileSystem
> > interface.
> > > > >>> --
> > > > >>> Pei
> > > > >>>
> > > > >>> On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré <
> > > j...@nanthrax.net
> > > > >
> > > > >>> wrote:
> > > > >>>
> > > > >>> By the way, Pei, for the record: why introducing BeamFileSystem
> and
> > > not
> > > > >>>> using the Hadoop FileSystem interface ?
> > > > >>>>
> > > > >>>> Thanks
> > > > >>>> Regards
> > > > >>>> JB
> > > > >>>>
> > > > >>>> On 11/17/2016 01:09 AM, Pei He wrote:
> > > > >>>>
> > > > >>>> Hi,
> > > > >>>>>
> > > > >>>>> I am working on BEAM-59
> > > > >>>>> <https://issues.apache.org/jira/browse/BEAM-59>
> > "IOChannelFactory
> > > > >>>>> redesign". The goals are:
> > > > >>>>>
> > > > >>>>> 1. Support file-based IOs (TextIO, AvorIO) with user-defined
> file
> > > > >>>>> system.
> > > > >>>>>
> > > > >>>>> 2. Support configuring any user-defined file system.
> > > > >>>>>
> > > > >>>>> And, I drafted the design proposal in two parts to address them
> > in
> > > > >>>>> order:
> > > > >>>>>
> > > > >>>>> Part 1: IOChannelFactory Redesign
> > > > >>>>> <https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJ
> > > > >>>>> sVG3qel2lhdKTknmZ_7M/edit#>
> > > > >>>>>
> > > > >>>>> Summary:
> > > > >>>>>
> > > > >>>>> Old API: WritableByteChannel create(String spec, String
> > mimeType);
> > > > >>>>>
> > > > >>>>> New API: WritableByteChannel create(URI uri, CreateOptions
> > > options);
> > > > >>>>>
> > > > >>>>> Noticeable proposed changes:
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>    1.
> > > > >>>>>
> > > > >>>>>    Includes the options parameter in most methods to specify
> > > > behaviors.
> > > > >>>>>    2.
> > > > >>>>>
> > > > >>>>>    Replace String with URI to include scheme for
> > files/directories
> > > > >>>>>    locations.
> > > > >>>>>    3.
> > > > >>>>>
> > > > >>>>>    Require file systems to provide a SeekableByteChannel for
> > read.
> > > > >>>>>    4.
> > > > >>>>>
> > > > >>>>>    Additional methods, such as getMetadata(), rename() e.t.c
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> Part 2: Configurable BeamFileSystem
> > > > >>>>> <https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4
> > > > >>>>> q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>
> > > > >>>>>
> > > > >>>>> Summary:
> > > > >>>>>
> > > > >>>>> Old API: IOChannelUtils.getFactory(glob).match(glob);
> > > > >>>>>
> > > > >>>>> New API: BeamFileSystems.getFileSystem(glob,
> > config).match(glob);
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> Looking for comments and feedback.
> > > > >>>>>
> > > > >>>>> Thanks
> > > > >>>>>
> > > > >>>>> --
> > > > >>>>>
> > > > >>>>> Pei
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> --
> > > > >>>> Jean-Baptiste Onofré
> > > > >>>> jbono...@apache.org
> > > > >>>> http://blog.nanthrax.net
> > > > >>>> Talend - http://www.talend.com
> > > > >>>>
> > > > >>>>
> > > > >>>
> > > > >> --
> > > > >> Jean-Baptiste Onofré
> > > > >> jbono...@apache.org
> > > > >> http://blog.nanthrax.net
> > > > >> Talend - http://www.talend.com
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to