[jira] [Commented] (FLINK-11499) Extend StreamingFileSink BulkFormats to support arbitrary roll policies

Sivaprasanna Sethuraman (Jira) Sat, 28 Mar 2020 01:21:09 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069306#comment-17069306
 ]


Sivaprasanna Sethuraman commented on FLINK-11499:
-------------------------------------------------

[~pnowojski]

Regarding point #1 in the failure/recovery scenario, are you implying that the 
main stream along with the roll up that happens, say, every hour, it also has 
to be rolled i.e. published on every checkpoint? Then are we not back to square 
one? That we end up with very few records in the rolled file? Please correct me 
if I'm wrong with the understanding of your point.

And I second [~kkl0u]' point that it may add up to latency and slow recovery 
times since we are going to reread the committed WAL files and reach out to the 
WIP WAL stream and then resume processing.

However, I think the storage bandwidth can still be managed and it won't be a 
big issue since most of the object stores are pretty cheap and if we guarantee 
that upon rolling up the main stream, we clear up the committed WAL files which 
fall under the same period of time which has been rolled up, it will not cause 
any issue in terms of storage.

> Extend StreamingFileSink BulkFormats to support arbitrary roll policies
> -----------------------------------------------------------------------
>
>                 Key: FLINK-11499
>                 URL: https://issues.apache.org/jira/browse/FLINK-11499
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / FileSystem
>            Reporter: Seth Wiesman
>            Priority: Major
>              Labels: usability
>             Fix For: 1.11.0
>
>
> Currently when using the StreamingFilleSink Bulk-encoding formats can only be 
> combined with the `OnCheckpointRollingPolicy`, which rolls the in-progress 
> part file on every checkpoint.
> However, many bulk formats such as parquet are most efficient when written as 
> large files; this is not possible when frequent checkpointing is enabled. 
> Currently the only work-around is to have long checkpoint intervals which is 
> not ideal.
>  
> The StreamingFileSink should be enhanced to support arbitrary roll policy's 
> so users may write large bulk files while retaining frequent checkpoints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-11499) Extend StreamingFileSink BulkFormats to support arbitrary roll policies

Reply via email to