[
https://issues.apache.org/jira/browse/FLINK-35704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860159#comment-17860159
]
Grzegorz Liter commented on FLINK-35704:
----------------------------------------
Pull request: https://github.com/apache/flink/pull/24986
> ForkJoinPool introduction to NonSplittingRecursiveEnumerator to vastly
> improve enumeration performance
> ------------------------------------------------------------------------------------------------------
>
> Key: FLINK-35704
> URL: https://issues.apache.org/jira/browse/FLINK-35704
> Project: Flink
> Issue Type: Improvement
> Components: Connectors / FileSystem
> Reporter: Grzegorz Liter
> Priority: Minor
> Attachments: ParallelNonSplittingRecursiveEnumerator.java
>
>
> In current implementation of NonSplittingRecursiveEnumerator the files and
> directories are enumerated in sequence. In case of accessing a remote storage
> like S3 the vast amount of time is wasted waiting for a response.
> What is worse the enumeration is done by JM it self during which it is
> unresponsive for RPC calls. When accessing multiple (thousands+) files the
> wait time can quickly add up and can cause a pekko timeout.
> The performance can be improved by enumerating files in parallel with e.g.
> ForkJoinPool and parallel streams. I am attaching example implementation that
> I am happy to contribute to Flink repository.
> In my tests it cuts the time at least 10x
--
This message was sent by Atlassian Jira
(v8.20.10#820010)