[ 
https://issues.apache.org/jira/browse/BEAM-59?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15989748#comment-15989748
 ] 

ASF GitHub Bot commented on BEAM-59:
------------------------------------

GitHub user dhalperi opened a pull request:

    https://github.com/apache/beam/pull/2779

    [BEAM-59] Convert WriteFiles/FileBasedSink from IOChannelFactory to 
FileSystems

    This converts FileBasedSink from IOChannelFactory to FileSystems, with
    fallout changes on all existing Transforms that use WriteFiles.
    
    We preserve the existing semantics of most transforms, simply adding the
    ability for users to provide ResourceId in addition to String when
    setting the outputPrefix.
    
    Other changes:
    
    * Make DefaultFilenamePolicy its own top-level class and move
      IOChannelUtils#constructName into it. This the default FilenamePolicy
      used by FilebasedSource.
    
    * Rethink FilenamePolicy as a function from ResourceId (base directory)
      to ResourceId (output file), moving the base directory into the
      context. This way, FilenamePolicy logic is truly independent from the
      base directory. Using ResourceId#resolve, a filename policy can add
      multiple path components, say, base/YYYY/MM/DD/file.txt, in a
      fileystem independent way.
    
      (Also add an optional extension parameter to the function, enabling an
      owning transform to pass in the suffix from a separately-configured
      compression factory or similar.)
    
    * Remove some old logic disallowing certain specific patterns of
      filenames that dates back to Cloud Dataflow SDKs on no-longer-used
      implementations.
    
    ----
    
    TODO:
    
    - [ ] I cleaned up TextIO and AvroIO, but XmlIO and TFRecordIO need more.
    - [ ] Review test coverage.
    - [ ] REALLY review testing and javadoc.
    
    But getting this out to be able to look at the comprehensive diff.
    
    CC: @davorbonaci @lukecwik @vikkyrk @jkff @reuvenlax 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dhalperi/beam convert-file-based-sink

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/beam/pull/2779.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2779
    
----
commit 1897a8756069237836a745ddaf38e9a0692db186
Author: Dan Halperin <[email protected]>
Date:   2017-04-25T17:10:28Z

    Convert WriteFiles/FileBasedSink from IOChannelFactory to FileSystems
    
    This converts FileBasedSink from IOChannelFactory to FileSystems, with
    fallout changes on all existing Transforms that use WriteFiles.
    
    We preserve the existing semantics of most transforms, simply adding the
    ability for users to provide ResourceId in addition to String when
    setting the outputPrefix.
    
    Other changes:
    
    * Make DefaultFilenamePolicy its own top-level class and move
      IOChannelUtils#constructName into it. This the default FilenamePolicy
      used by FilebasedSource.
    
    * Rethink FilenamePolicy as a function from ResourceId (base directory)
      to ResourceId (output file), moving the base directory into the
      context. This way, FilenamePolicy logic is truly independent from the
      base directory. Using ResourceId#resolve, a filename policy can add
      multiple path components, say, base/YYYY/MM/DD/file.txt, in a
      fileystem independent way.
    
      (Also add an optional extension parameter to the function, enabling an
      owning transform to pass in the suffix from a separately-configured
      compression factory or similar.)
    
    * Remove some old logic disallowing certain specific patterns of
      filenames that dates back to Cloud Dataflow SDKs on no-longer-used
      implementations.

----


> Switch from IOChannelFactory to FileSystems
> -------------------------------------------
>
>                 Key: BEAM-59
>                 URL: https://issues.apache.org/jira/browse/BEAM-59
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-core, sdk-java-gcp
>            Reporter: Daniel Halperin
>            Assignee: Daniel Halperin
>             Fix For: First stable release
>
>
> Right now, FileBasedSource and FileBasedSink communication is mediated by 
> IOChannelFactory. There are a number of issues:
> * Global configuration -- e.g., all 'gs://' URIs use the same credentials. 
> This should be per-source/per-sink/etc.
> * Supported APIs -- currently IOChannelFactory is in the "non-public API" 
> util package and subject to change. We need users to be able to add new 
> backends ('s3://', 'hdfs://', etc.) directly, without fear that they will be 
> broken.
> * Per-backend features: e.g., creating buckets in GCS/s3, setting expiration 
> time, etc.
> Updates:
> Design docs posted on dev@ list:
> Part 1: IOChannelFactory Redesign: 
> https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit#
> Part 2: Configurable BeamFileSystem:
> https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to