[jira] [Commented] (FLINK-9749) Rework Bucketing Sink

Kostas Kloudas (Jira) Mon, 23 Sep 2019 02:55:16 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935703#comment-16935703
 ]


Kostas Kloudas commented on FLINK-9749:
---------------------------------------

Hi [~dangdangdang]! We just merged 
https://issues.apache.org/jira/browse/FLINK-13864 to the master which allows to 
extend the {{StreamingFileSink}}. 

Essentially with this change we aim to allow users to implement their specific 
need on top of the {{StreamingFileSink}} without requiring (hopefully) changes 
to Flink itself. 

Please have a look and let us know if this allows you to implement your usecase 
and if not, let us know what is missing.

> Rework Bucketing Sink
> ---------------------
>
>                 Key: FLINK-9749
>                 URL: https://issues.apache.org/jira/browse/FLINK-9749
>             Project: Flink
>          Issue Type: New Feature
>          Components: Connectors / FileSystem
>            Reporter: Stephan Ewen
>            Assignee: Kostas Kloudas
>            Priority: Major
>
> The BucketingSink has a series of deficits at the moment.
> Due to the long list of issues, I would suggest to add a new 
> StreamingFileSink with a new and cleaner design
> h3. Encoders, Parquet, ORC
>  - It only efficiently supports row-wise data formats (avro, jso, sequence 
> files.
>  - Efforts to add (columnar) compression for blocks of data is inefficient, 
> because blocks cannot span checkpoints due to persistence-on-checkpoint.
>  - The encoders are part of the \{{flink-connector-filesystem project}}, 
> rather than in orthogonal formats projects. This blows up the dependencies of 
> the \{{flink-connector-filesystem project}} project. As an example, the 
> rolling file sink has dependencies on Hadoop and Avro, which messes up 
> dependency management.
> h3. Use of FileSystems
>  - The BucketingSink works only on Hadoop's FileSystem abstraction not 
> support Flink's own FileSystem abstraction and cannot work with the packaged 
> S3, maprfs, and swift file systems
>  - The sink hence needs Hadoop as a dependency
>  - The sink relies on "trying out" whether truncation works, which requires 
> write access to the users working directory
>  - The sink relies on enumerating and counting files, rather than maintaining 
> its own state, making less efficient
> h3. Correctness and Efficiency on S3
>  - The BucketingSink relies on strong consistency in the file enumeration, 
> hence may work incorrectly on S3.
>  - The BucketingSink relies on persisting streams at intermediate points. 
> This is not working properly on S3, hence there may be data loss on S3.
> h3. .valid-length companion file
>  - The valid length file makes it hard for consumers of the data and should 
> be dropped
> We track this design in a series of sub issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-9749) Rework Bucketing Sink

Reply via email to