[
https://issues.apache.org/jira/browse/FLINK-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Fokko Driesprong updated FLINK-9749:
------------------------------------
Description:
The BucketingSink has a series of deficits at the moment.
Due to the long list of issues, I would suggest to add a new StreamingFileSink
with a new and cleaner design
h3. Encoders, Parquet, ORC
- It only efficiently supports row-wise data formats (avro, json, sequence
files).
- Efforts to add (columnar) compression for blocks of data is inefficient,
because blocks cannot span checkpoints due to persistence-on-checkpoint.
- The encoders are part of the {{flink-connector-filesystem project}}, rather
than in orthogonal formats projects. This blows up the dependencies of the
{{flink-connector-filesystem project}} project. As an example, the rolling file
sink has dependencies on Hadoop and Avro, which messes up dependency management.
h3. Use of FileSystems
- The BucketingSink works only on Hadoop's FileSystem abstraction not support
Flink's own FileSystem abstraction and cannot work with the packaged S3,
maprfs, and swift file systems
- The sink hence needs Hadoop as a dependency
- The sink relies on "trying out" whether truncation works, which requires
write access to the users working directory
- The sink relies on enumerating and counting files, rather than maintaining
its own state, making less efficient
h3. Correctness and Efficiency on S3
- The BucketingSink relies on strong consistency in the file enumeration,
hence may work incorrectly on S3.
- The BucketingSink relies on persisting streams at intermediate points. This
is not working properly on S3, hence there may be data loss on S3.
h3. .valid-length companion file
- The valid length file makes it hard for consumers of the data and should be
dropped
We track this design in a series of sub issues.
was:
The BucketingSink has a series of deficits at the moment.
Due to the long list of issues, I would suggest to add a new StreamingFileSink
with a new and cleaner design
h3. Encoders, Parquet, ORC
- It only efficiently supports row-wise data formats (avro, jso, sequence
files.
- Efforts to add (columnar) compression for blocks of data is inefficient,
because blocks cannot span checkpoints due to persistence-on-checkpoint.
- The encoders are part of the \{{flink-connector-filesystem project}}, rather
than in orthogonal formats projects. This blows up the dependencies of the
\{{flink-connector-filesystem project}} project. As an example, the rolling
file sink has dependencies on Hadoop and Avro, which messes up dependency
management.
h3. Use of FileSystems
- The BucketingSink works only on Hadoop's FileSystem abstraction not support
Flink's own FileSystem abstraction and cannot work with the packaged S3,
maprfs, and swift file systems
- The sink hence needs Hadoop as a dependency
- The sink relies on "trying out" whether truncation works, which requires
write access to the users working directory
- The sink relies on enumerating and counting files, rather than maintaining
its own state, making less efficient
h3. Correctness and Efficiency on S3
- The BucketingSink relies on strong consistency in the file enumeration,
hence may work incorrectly on S3.
- The BucketingSink relies on persisting streams at intermediate points. This
is not working properly on S3, hence there may be data loss on S3.
h3. .valid-length companion file
- The valid length file makes it hard for consumers of the data and should be
dropped
We track this design in a series of sub issues.
> Rework Bucketing Sink
> ---------------------
>
> Key: FLINK-9749
> URL: https://issues.apache.org/jira/browse/FLINK-9749
> Project: Flink
> Issue Type: New Feature
> Components: Connectors / FileSystem
> Reporter: Stephan Ewen
> Assignee: Kostas Kloudas
> Priority: Major
>
> The BucketingSink has a series of deficits at the moment.
> Due to the long list of issues, I would suggest to add a new
> StreamingFileSink with a new and cleaner design
> h3. Encoders, Parquet, ORC
> - It only efficiently supports row-wise data formats (avro, json, sequence
> files).
> - Efforts to add (columnar) compression for blocks of data is inefficient,
> because blocks cannot span checkpoints due to persistence-on-checkpoint.
> - The encoders are part of the {{flink-connector-filesystem project}},
> rather than in orthogonal formats projects. This blows up the dependencies of
> the {{flink-connector-filesystem project}} project. As an example, the
> rolling file sink has dependencies on Hadoop and Avro, which messes up
> dependency management.
> h3. Use of FileSystems
> - The BucketingSink works only on Hadoop's FileSystem abstraction not
> support Flink's own FileSystem abstraction and cannot work with the packaged
> S3, maprfs, and swift file systems
> - The sink hence needs Hadoop as a dependency
> - The sink relies on "trying out" whether truncation works, which requires
> write access to the users working directory
> - The sink relies on enumerating and counting files, rather than maintaining
> its own state, making less efficient
> h3. Correctness and Efficiency on S3
> - The BucketingSink relies on strong consistency in the file enumeration,
> hence may work incorrectly on S3.
> - The BucketingSink relies on persisting streams at intermediate points.
> This is not working properly on S3, hence there may be data loss on S3.
> h3. .valid-length companion file
> - The valid length file makes it hard for consumers of the data and should
> be dropped
> We track this design in a series of sub issues.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)