[ 
https://issues.apache.org/jira/browse/FLINK-35704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860159#comment-17860159
 ] 

Grzegorz Liter commented on FLINK-35704:
----------------------------------------

Pull request: https://github.com/apache/flink/pull/24986


> ForkJoinPool introduction to NonSplittingRecursiveEnumerator to vastly 
> improve enumeration performance
> ------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-35704
>                 URL: https://issues.apache.org/jira/browse/FLINK-35704
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / FileSystem
>            Reporter: Grzegorz Liter
>            Priority: Minor
>         Attachments: ParallelNonSplittingRecursiveEnumerator.java
>
>
> In current implementation of NonSplittingRecursiveEnumerator the files and 
> directories are enumerated in sequence. In case of accessing a remote storage 
> like S3 the vast amount of time is wasted waiting for a response.
> What is worse the enumeration is done by JM it self during which it is 
> unresponsive for RPC calls. When accessing multiple (thousands+) files the 
> wait time can quickly add up and can cause a pekko timeout.
> The performance can be improved by enumerating files in parallel with e.g. 
> ForkJoinPool and parallel streams. I am attaching example implementation that 
> I am happy to contribute to Flink repository.
> In my tests it cuts the time at least 10x



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to