[
https://issues.apache.org/jira/browse/FLINK-24392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Metzger updated FLINK-24392:
-----------------------------------
Description:
The Presto s3 filesystem implementation currently shipped with Flink doesn't
support streaming uploads. All data needs to be materialized to a single file
on disk, before it can be uploaded.
This can lead to situations where TaskManagers are running out of disk when
creating a savepoint.
The Hadoop filesystem implementation supports streaming uploads (by using
multipart uploads of smaller (say 100mb) files locally), but it does more API
calls, leading to other issues.
Trino version >= 348 supports streaming uploads.
During experiments, I also noticed that the current presto s3 fs implementation
seems to allocate a lot of memory outside the heap (when shipping large data,
for example when creating a savepoint). On a K8s pod with a memory limit of
4000Mi, I was not able to run Flink with a "taskmanager.memory.flink.size"
above 3000m. This means that an additional 1gb of memory needs to be allocated
just for the peaks in memory allocation when presto s3 is taking a savepoint.
As part of this upgrade, we also need to make sure that the new presto / Trino
version is not doing substantially more S3 API calls than the current version.
After switching away from the presto s3 to hadoop s3, I noticed that disposing
an old checkpoint (~100gb) can take up to 15 minutes. The upgraded presto s3 fs
should still be able to quickly dispose state.
was:
The Presto s3 filesystem implementation currently shipped with Flink doesn't
support streaming uploads. All data needs to be materialized to a single file
on disk, before it can be uploaded.
This can lead to situations where TaskManagers are running out of disk when
creating a savepoint.
The Hadoop filesystem implementation supports streaming uploads (by using
multipart uploads of smaller (say 100mb) files locally), but it does more API
calls, leading to other issues.
Trino 348 supports streaming uploads.
> Upgrade presto s3 fs implementation to Trino >= 348
> ---------------------------------------------------
>
> Key: FLINK-24392
> URL: https://issues.apache.org/jira/browse/FLINK-24392
> Project: Flink
> Issue Type: Improvement
> Components: FileSystems
> Affects Versions: 1.14.0
> Reporter: Robert Metzger
> Priority: Major
> Fix For: 1.15.0
>
>
> The Presto s3 filesystem implementation currently shipped with Flink doesn't
> support streaming uploads. All data needs to be materialized to a single file
> on disk, before it can be uploaded.
> This can lead to situations where TaskManagers are running out of disk when
> creating a savepoint.
> The Hadoop filesystem implementation supports streaming uploads (by using
> multipart uploads of smaller (say 100mb) files locally), but it does more API
> calls, leading to other issues.
> Trino version >= 348 supports streaming uploads.
> During experiments, I also noticed that the current presto s3 fs
> implementation seems to allocate a lot of memory outside the heap (when
> shipping large data, for example when creating a savepoint). On a K8s pod
> with a memory limit of 4000Mi, I was not able to run Flink with a
> "taskmanager.memory.flink.size" above 3000m. This means that an additional
> 1gb of memory needs to be allocated just for the peaks in memory allocation
> when presto s3 is taking a savepoint.
> As part of this upgrade, we also need to make sure that the new presto /
> Trino version is not doing substantially more S3 API calls than the current
> version. After switching away from the presto s3 to hadoop s3, I noticed that
> disposing an old checkpoint (~100gb) can take up to 15 minutes. The upgraded
> presto s3 fs should still be able to quickly dispose state.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)