Piotr Nowojski created FLINK-35739:
--------------------------------------
Summary: FLIP-444: Native file copy support
Key: FLINK-35739
URL: https://issues.apache.org/jira/browse/FLINK-35739
Project: Flink
Issue Type: New Feature
Components: Connectors / FileSystem
Reporter: Piotr Nowojski
https://cwiki.apache.org/confluence/display/FLINK/FLIP-444%3A+Native+file+copy+support
State downloading in Flink can be a time and CPU consuming operation, which is
especially visible if CPU resources per task slot are strictly restricted to
for example a single CPU. Downloading 1GB of state size can take significant
amount of time, while the code doing so is quite inefficient.
Currently when downloading state files, Flink is creating an FSDataInputStream
from the remote file, and copies its bytes, to an OutputStream pointing to a
local file (in the RocksDBStateDownloader#downloadDataForStateHandle method).
FSDataInputStream internally is being wrapped by many layers of abstractions
and indirections and what’s worse, every file is being copied individually,
which leads to quite high overheads for small files. Download times and
download process CPU efficiency can be significantly improved if we introduced
an API to allow org.apache.flink.core.fs.FileSystem to copy many files natively
and all at once.
For S3, there are at least two potential implementations. The first one is
using AWS SDKv2 directly (Flink currently is using AWS SDKv1 wrapped by
hadoop/presto) and Amazon S3 Transfer Manager. Second option is to use a 3rd
party tool called s5cmd. It is claimed to be a faster alternative to the
official AWS clients, which was confirmed by our benchmarks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)