[ 
https://issues.apache.org/jira/browse/FLINK-24392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421630#comment-17421630
 ] 

Seth Wiesman commented on FLINK-24392:
--------------------------------------

Potentially as follow up, I think its worth investigating after this upgrade if 
streaming uploads should be the default. I have a hard time thinking of a case 
when you wouldn't want this behavior but perhaps someone can come up with a 
counter example. 

> Upgrade presto s3 fs implementation to Trino >= 348
> ---------------------------------------------------
>
>                 Key: FLINK-24392
>                 URL: https://issues.apache.org/jira/browse/FLINK-24392
>             Project: Flink
>          Issue Type: Improvement
>          Components: FileSystems
>    Affects Versions: 1.14.0
>            Reporter: Robert Metzger
>            Priority: Major
>             Fix For: 1.15.0
>
>
> The Presto s3 filesystem implementation currently shipped with Flink doesn't 
> support streaming uploads. All data needs to be materialized to a single file 
> on disk, before it can be uploaded.
> This can lead to situations where TaskManagers are running out of disk when 
> creating a savepoint.
> The Hadoop filesystem implementation supports streaming uploads (by using 
> multipart uploads of smaller (say 100mb) files locally), but it does more API 
> calls, leading to other issues.
> Trino version >= 348 supports streaming uploads.
> During experiments, I also noticed that the current presto s3 fs 
> implementation seems to allocate a lot of memory outside the heap (when 
> shipping large data, for example when creating a savepoint). On a K8s pod 
> with a memory limit of 4000Mi, I was not able to run Flink with a 
> "taskmanager.memory.flink.size" above 3000m. This means that an additional 
> 1gb of memory needs to be allocated just for the peaks in memory allocation 
> when presto s3 is taking a savepoint.
> As part of this upgrade, we also need to make sure that the new presto / 
> Trino version is not doing substantially more S3 API calls than the current 
> version. After switching away from the presto s3 to hadoop s3, I noticed that 
> disposing an old checkpoint (~100gb) can take up to 15 minutes. The upgraded 
> presto s3 fs should still be able to quickly dispose state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to