[
https://issues.apache.org/jira/browse/HDFS-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16474674#comment-16474674
]
Ajay Sachdev commented on HDFS-13398:
-------------------------------------
Hi Rushabh,
Thats a good question. As you may know FJP is very critical for multi-core
systems and it forks large tasks into smaller subtasks and then joins (waits)
individual results to form final result (Divide and Conquer approach). We tried
different number for parallelism (ie number of threads) in ForkJoinPool
framework such as 8, 16, 32 and 64. The use case was a directory/file hierarchy
structure of 40K+ and a nested tree-level. In this scenario we were able to
demonstrate that a numberOfThreads=32 was ideal configuration. So our tests
were based off these variables.
FsShell commands (single threaded approach) -
ls -R -> 12 mins
du -> 14 mins
count -> 14 mins
FsShell commands (FJP approach) -
ls -R -> 3 mins
du -> 2 mins
count -> 9 mins
> Hdfs recursive listing operation is very slow
> ---------------------------------------------
>
> Key: HDFS-13398
> URL: https://issues.apache.org/jira/browse/HDFS-13398
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs
> Affects Versions: 2.7.1
> Environment: HCFS file system where HDP 2.6.1 is connected to ECS
> (Object Store).
> Reporter: Ajay Sachdev
> Assignee: Ajay Sachdev
> Priority: Major
> Fix For: 2.7.1
>
> Attachments: HDFS-13398.001.patch, HDFS-13398.002.patch,
> parallelfsPatch
>
>
> The hdfs dfs -ls -R command is sequential in nature and is very slow for a
> HCFS system. We have seen around 6 mins for 40K directory/files structure.
> The proposal is to use multithreading approach to speed up recursive list, du
> and count operations.
> We have tried a ForkJoinPool implementation to improve performance for
> recursive listing operation.
> [https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli]
> commit id :
> 82387c8cd76c2e2761bd7f651122f83d45ae8876
> Another implementation is to use Java Executor Service to improve performance
> to run listing operation in multiple threads in parallel. This has
> significantly reduced the time to 40 secs from 6 mins.
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]