[ 
https://issues.apache.org/jira/browse/HADOOP-15471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16478429#comment-16478429
 ] 

Mukul Kumar Singh commented on HADOOP-15471:
--------------------------------------------

Thanks for the  [^HDFS-13398.002.patch] [~ajaysachdev]. Some major comments on 
the patch.

1) The current recursion in Command@recursePath uses the depth variable to 
recurse down the tree. I feel we should synchronize and localize the 
modification to this variable.
2) Apache Hadoop uses, 2 spaces for indentation. Please use the same coding 
guidelines in the patch.
3) Lets also add a unit test for this patch, We can add a unit test where a 
multi level directory structure is parsed through both the current method as 
well as the new method in the patch and lets compare the results to verify the 
validity of the patch.
4) Also I feel in place of a config variable "fs.threads", number of threads 
should be made a command line argument, so that the user can control the number 
of threads for each invocation of the command.

> Hdfs recursive listing operation is very slow
> ---------------------------------------------
>
>                 Key: HADOOP-15471
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15471
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 2.7.1
>         Environment: HCFS file system where HDP 2.6.1 is connected to ECS 
> (Object Store).
>            Reporter: Ajay Sachdev
>            Assignee: Ajay Sachdev
>            Priority: Major
>             Fix For: 2.7.1
>
>         Attachments: HDFS-13398.001.patch, HDFS-13398.002.patch, 
> parallelfsPatch
>
>
> The hdfs dfs -ls -R command is sequential in nature and is very slow for a 
> HCFS system. We have seen around 6 mins for 40K directory/files structure.
> The proposal is to use multithreading approach to speed up recursive list, du 
> and count operations.
> We have tried a ForkJoinPool implementation to improve performance for 
> recursive listing operation.
> [https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli]
> commit id : 
> 82387c8cd76c2e2761bd7f651122f83d45ae8876
> Another implementation is to use Java Executor Service to improve performance 
> to run listing operation in multiple threads in parallel. This has 
> significantly reduced the time to 40 secs from 6 mins.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to