[
https://issues.apache.org/jira/browse/HADOOP-15471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16478429#comment-16478429
]
Mukul Kumar Singh commented on HADOOP-15471:
--------------------------------------------
Thanks for the [^HDFS-13398.002.patch] [~ajaysachdev]. Some major comments on
the patch.
1) The current recursion in Command@recursePath uses the depth variable to
recurse down the tree. I feel we should synchronize and localize the
modification to this variable.
2) Apache Hadoop uses, 2 spaces for indentation. Please use the same coding
guidelines in the patch.
3) Lets also add a unit test for this patch, We can add a unit test where a
multi level directory structure is parsed through both the current method as
well as the new method in the patch and lets compare the results to verify the
validity of the patch.
4) Also I feel in place of a config variable "fs.threads", number of threads
should be made a command line argument, so that the user can control the number
of threads for each invocation of the command.
> Hdfs recursive listing operation is very slow
> ---------------------------------------------
>
> Key: HADOOP-15471
> URL: https://issues.apache.org/jira/browse/HADOOP-15471
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs
> Affects Versions: 2.7.1
> Environment: HCFS file system where HDP 2.6.1 is connected to ECS
> (Object Store).
> Reporter: Ajay Sachdev
> Assignee: Ajay Sachdev
> Priority: Major
> Fix For: 2.7.1
>
> Attachments: HDFS-13398.001.patch, HDFS-13398.002.patch,
> parallelfsPatch
>
>
> The hdfs dfs -ls -R command is sequential in nature and is very slow for a
> HCFS system. We have seen around 6 mins for 40K directory/files structure.
> The proposal is to use multithreading approach to speed up recursive list, du
> and count operations.
> We have tried a ForkJoinPool implementation to improve performance for
> recursive listing operation.
> [https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli]
> commit id :
> 82387c8cd76c2e2761bd7f651122f83d45ae8876
> Another implementation is to use Java Executor Service to improve performance
> to run listing operation in multiple threads in parallel. This has
> significantly reduced the time to 40 secs from 6 mins.
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]