[
https://issues.apache.org/jira/browse/FLINK-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17098975#comment-17098975
]
Piotr Nowojski commented on FLINK-11499:
----------------------------------------
A small status update here.
1. I was doing some PoC in that direction and I quickly realised that it would
need to modify most of the existing StreamingFileSink (SFS) classes.
SFS/Bucket/Buckets/… have hardcoded assumptions about working on a single path,
lack support of reading or cleaning up/deleting files.
2. There is some concurrent effort on using StreamingFileSink with Hadoop based
file systems, without using our RecoverableFileSystem abstraction, which would
probably conflict with WAL changes ([~maguowei] is taking care of it).
3. We haven’t figured out how to deal with changes to the record format, for
example on job upgrades. With current SFS there are no issues with that, as
there are no in-flight data. Record is written once, and when it is written, we
can completely forget about it. With WAL, upon recovery we need to read such
records, which creates problems: what if records schema/format has changed.
This is something that could be dealt with in couple of ways (either supporting
some migration/backward compatibility, or add a requirement to clean
cut/completely empty WAL on job upgrade when using save point), but either way
that would be a source of extra complexity.
Because of that we started to consider going first with another approach to the
problem: https://issues.apache.org/jira/browse/FLINK-17505 .
> Extend StreamingFileSink BulkFormats to support arbitrary roll policies
> -----------------------------------------------------------------------
>
> Key: FLINK-11499
> URL: https://issues.apache.org/jira/browse/FLINK-11499
> Project: Flink
> Issue Type: Improvement
> Components: Connectors / FileSystem
> Reporter: Seth Wiesman
> Priority: Major
> Labels: usability
> Fix For: 1.11.0
>
>
> Currently when using the StreamingFilleSink Bulk-encoding formats can only be
> combined with the `OnCheckpointRollingPolicy`, which rolls the in-progress
> part file on every checkpoint.
> However, many bulk formats such as parquet are most efficient when written as
> large files; this is not possible when frequent checkpointing is enabled.
> Currently the only work-around is to have long checkpoint intervals which is
> not ideal.
>
> The StreamingFileSink should be enhanced to support arbitrary roll policy's
> so users may write large bulk files while retaining frequent checkpoints.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)