[ https://issues.apache.org/jira/browse/FLINK-24392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421630#comment-17421630 ]
Seth Wiesman commented on FLINK-24392: -------------------------------------- Potentially as follow up, I think its worth investigating after this upgrade if streaming uploads should be the default. I have a hard time thinking of a case when you wouldn't want this behavior but perhaps someone can come up with a counter example. > Upgrade presto s3 fs implementation to Trino >= 348 > --------------------------------------------------- > > Key: FLINK-24392 > URL: https://issues.apache.org/jira/browse/FLINK-24392 > Project: Flink > Issue Type: Improvement > Components: FileSystems > Affects Versions: 1.14.0 > Reporter: Robert Metzger > Priority: Major > Fix For: 1.15.0 > > > The Presto s3 filesystem implementation currently shipped with Flink doesn't > support streaming uploads. All data needs to be materialized to a single file > on disk, before it can be uploaded. > This can lead to situations where TaskManagers are running out of disk when > creating a savepoint. > The Hadoop filesystem implementation supports streaming uploads (by using > multipart uploads of smaller (say 100mb) files locally), but it does more API > calls, leading to other issues. > Trino version >= 348 supports streaming uploads. > During experiments, I also noticed that the current presto s3 fs > implementation seems to allocate a lot of memory outside the heap (when > shipping large data, for example when creating a savepoint). On a K8s pod > with a memory limit of 4000Mi, I was not able to run Flink with a > "taskmanager.memory.flink.size" above 3000m. This means that an additional > 1gb of memory needs to be allocated just for the peaks in memory allocation > when presto s3 is taking a savepoint. > As part of this upgrade, we also need to make sure that the new presto / > Trino version is not doing substantially more S3 API calls than the current > version. After switching away from the presto s3 to hadoop s3, I noticed that > disposing an old checkpoint (~100gb) can take up to 15 minutes. The upgraded > presto s3 fs should still be able to quickly dispose state. -- This message was sent by Atlassian Jira (v8.3.4#803005)