GitHub user dhalperi opened a pull request:
https://github.com/apache/beam/pull/2779
[BEAM-59] Convert WriteFiles/FileBasedSink from IOChannelFactory to
FileSystems
This converts FileBasedSink from IOChannelFactory to FileSystems, with
fallout changes on all existing Transforms that use WriteFiles.
We preserve the existing semantics of most transforms, simply adding the
ability for users to provide ResourceId in addition to String when
setting the outputPrefix.
Other changes:
* Make DefaultFilenamePolicy its own top-level class and move
IOChannelUtils#constructName into it. This the default FilenamePolicy
used by FilebasedSource.
* Rethink FilenamePolicy as a function from ResourceId (base directory)
to ResourceId (output file), moving the base directory into the
context. This way, FilenamePolicy logic is truly independent from the
base directory. Using ResourceId#resolve, a filename policy can add
multiple path components, say, base/YYYY/MM/DD/file.txt, in a
fileystem independent way.
(Also add an optional extension parameter to the function, enabling an
owning transform to pass in the suffix from a separately-configured
compression factory or similar.)
* Remove some old logic disallowing certain specific patterns of
filenames that dates back to Cloud Dataflow SDKs on no-longer-used
implementations.
----
TODO:
- [ ] I cleaned up TextIO and AvroIO, but XmlIO and TFRecordIO need more.
- [ ] Review test coverage.
- [ ] REALLY review testing and javadoc.
But getting this out to be able to look at the comprehensive diff.
CC: @davorbonaci @lukecwik @vikkyrk @jkff @reuvenlax
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dhalperi/beam convert-file-based-sink
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/beam/pull/2779.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2779
----
commit 1897a8756069237836a745ddaf38e9a0692db186
Author: Dan Halperin <[email protected]>
Date: 2017-04-25T17:10:28Z
Convert WriteFiles/FileBasedSink from IOChannelFactory to FileSystems
This converts FileBasedSink from IOChannelFactory to FileSystems, with
fallout changes on all existing Transforms that use WriteFiles.
We preserve the existing semantics of most transforms, simply adding the
ability for users to provide ResourceId in addition to String when
setting the outputPrefix.
Other changes:
* Make DefaultFilenamePolicy its own top-level class and move
IOChannelUtils#constructName into it. This the default FilenamePolicy
used by FilebasedSource.
* Rethink FilenamePolicy as a function from ResourceId (base directory)
to ResourceId (output file), moving the base directory into the
context. This way, FilenamePolicy logic is truly independent from the
base directory. Using ResourceId#resolve, a filename policy can add
multiple path components, say, base/YYYY/MM/DD/file.txt, in a
fileystem independent way.
(Also add an optional extension parameter to the function, enabling an
owning transform to pass in the suffix from a separately-configured
compression factory or similar.)
* Remove some old logic disallowing certain specific patterns of
filenames that dates back to Cloud Dataflow SDKs on no-longer-used
implementations.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---