[jira] [Commented] (FLINK-9749) Rework Bucketing Sink

Kostas Kloudas (Jira) Fri, 20 Sep 2019 04:56:24 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16934324#comment-16934324
 ]


Kostas Kloudas commented on FLINK-9749:
---------------------------------------

I see [~dangdangdang] . As far as state is concerned, the book-keeping of the 
files (which offsets are valid, which windows are open, when to close them, 
etc) has to be done by Flink, as Flink is the one that will have to revert 
invalid data in case of failures. So some of the state will have to kept in 
Flink. In addition, I think that 1) performance will not be good, as this will 
be like the filesystem statebackend offered by Flink but without the in-memory 
part, and 2) the complexity that we will have to introduce to the sink and the 
maintenance burden overwhelm the benefits that it brings. 

In general, I find that having two different ways for doing the same thing is 
not a good idea, especially from a maintenance point of view.

If your main goal is state handling, then I would recommend having a look at 
this ongoing discussion about having a spillable state backend here 
[https://cwiki.apache.org/confluence/display/FLINK/FLIP-50%3A+Spill-able+Heap+Keyed+State+Backend]
 .

> Rework Bucketing Sink
> ---------------------
>
>                 Key: FLINK-9749
>                 URL: https://issues.apache.org/jira/browse/FLINK-9749
>             Project: Flink
>          Issue Type: New Feature
>          Components: Connectors / FileSystem
>            Reporter: Stephan Ewen
>            Assignee: Kostas Kloudas
>            Priority: Major
>
> The BucketingSink has a series of deficits at the moment.
> Due to the long list of issues, I would suggest to add a new 
> StreamingFileSink with a new and cleaner design
> h3. Encoders, Parquet, ORC
>  - It only efficiently supports row-wise data formats (avro, jso, sequence 
> files.
>  - Efforts to add (columnar) compression for blocks of data is inefficient, 
> because blocks cannot span checkpoints due to persistence-on-checkpoint.
>  - The encoders are part of the \{{flink-connector-filesystem project}}, 
> rather than in orthogonal formats projects. This blows up the dependencies of 
> the \{{flink-connector-filesystem project}} project. As an example, the 
> rolling file sink has dependencies on Hadoop and Avro, which messes up 
> dependency management.
> h3. Use of FileSystems
>  - The BucketingSink works only on Hadoop's FileSystem abstraction not 
> support Flink's own FileSystem abstraction and cannot work with the packaged 
> S3, maprfs, and swift file systems
>  - The sink hence needs Hadoop as a dependency
>  - The sink relies on "trying out" whether truncation works, which requires 
> write access to the users working directory
>  - The sink relies on enumerating and counting files, rather than maintaining 
> its own state, making less efficient
> h3. Correctness and Efficiency on S3
>  - The BucketingSink relies on strong consistency in the file enumeration, 
> hence may work incorrectly on S3.
>  - The BucketingSink relies on persisting streams at intermediate points. 
> This is not working properly on S3, hence there may be data loss on S3.
> h3. .valid-length companion file
>  - The valid length file makes it hard for consumers of the data and should 
> be dropped
> We track this design in a series of sub issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-9749) Rework Bucketing Sink

Reply via email to