[
https://issues.apache.org/jira/browse/BEAM-59?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15989748#comment-15989748
]
ASF GitHub Bot commented on BEAM-59:
------------------------------------
GitHub user dhalperi opened a pull request:
https://github.com/apache/beam/pull/2779
[BEAM-59] Convert WriteFiles/FileBasedSink from IOChannelFactory to
FileSystems
This converts FileBasedSink from IOChannelFactory to FileSystems, with
fallout changes on all existing Transforms that use WriteFiles.
We preserve the existing semantics of most transforms, simply adding the
ability for users to provide ResourceId in addition to String when
setting the outputPrefix.
Other changes:
* Make DefaultFilenamePolicy its own top-level class and move
IOChannelUtils#constructName into it. This the default FilenamePolicy
used by FilebasedSource.
* Rethink FilenamePolicy as a function from ResourceId (base directory)
to ResourceId (output file), moving the base directory into the
context. This way, FilenamePolicy logic is truly independent from the
base directory. Using ResourceId#resolve, a filename policy can add
multiple path components, say, base/YYYY/MM/DD/file.txt, in a
fileystem independent way.
(Also add an optional extension parameter to the function, enabling an
owning transform to pass in the suffix from a separately-configured
compression factory or similar.)
* Remove some old logic disallowing certain specific patterns of
filenames that dates back to Cloud Dataflow SDKs on no-longer-used
implementations.
----
TODO:
- [ ] I cleaned up TextIO and AvroIO, but XmlIO and TFRecordIO need more.
- [ ] Review test coverage.
- [ ] REALLY review testing and javadoc.
But getting this out to be able to look at the comprehensive diff.
CC: @davorbonaci @lukecwik @vikkyrk @jkff @reuvenlax
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dhalperi/beam convert-file-based-sink
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/beam/pull/2779.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2779
----
commit 1897a8756069237836a745ddaf38e9a0692db186
Author: Dan Halperin <[email protected]>
Date: 2017-04-25T17:10:28Z
Convert WriteFiles/FileBasedSink from IOChannelFactory to FileSystems
This converts FileBasedSink from IOChannelFactory to FileSystems, with
fallout changes on all existing Transforms that use WriteFiles.
We preserve the existing semantics of most transforms, simply adding the
ability for users to provide ResourceId in addition to String when
setting the outputPrefix.
Other changes:
* Make DefaultFilenamePolicy its own top-level class and move
IOChannelUtils#constructName into it. This the default FilenamePolicy
used by FilebasedSource.
* Rethink FilenamePolicy as a function from ResourceId (base directory)
to ResourceId (output file), moving the base directory into the
context. This way, FilenamePolicy logic is truly independent from the
base directory. Using ResourceId#resolve, a filename policy can add
multiple path components, say, base/YYYY/MM/DD/file.txt, in a
fileystem independent way.
(Also add an optional extension parameter to the function, enabling an
owning transform to pass in the suffix from a separately-configured
compression factory or similar.)
* Remove some old logic disallowing certain specific patterns of
filenames that dates back to Cloud Dataflow SDKs on no-longer-used
implementations.
----
> Switch from IOChannelFactory to FileSystems
> -------------------------------------------
>
> Key: BEAM-59
> URL: https://issues.apache.org/jira/browse/BEAM-59
> Project: Beam
> Issue Type: New Feature
> Components: sdk-java-core, sdk-java-gcp
> Reporter: Daniel Halperin
> Assignee: Daniel Halperin
> Fix For: First stable release
>
>
> Right now, FileBasedSource and FileBasedSink communication is mediated by
> IOChannelFactory. There are a number of issues:
> * Global configuration -- e.g., all 'gs://' URIs use the same credentials.
> This should be per-source/per-sink/etc.
> * Supported APIs -- currently IOChannelFactory is in the "non-public API"
> util package and subject to change. We need users to be able to add new
> backends ('s3://', 'hdfs://', etc.) directly, without fear that they will be
> broken.
> * Per-backend features: e.g., creating buckets in GCS/s3, setting expiration
> time, etc.
> Updates:
> Design docs posted on dev@ list:
> Part 1: IOChannelFactory Redesign:
> https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit#
> Part 2: Configurable BeamFileSystem:
> https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)