[jira] [Commented] (HDFS-13398) Hdfs recursive listing operation is very slow
[ https://issues.apache.org/jira/browse/HDFS-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16474674#comment-16474674 ] Ajay Sachdev commented on HDFS-13398: - Hi Rushabh, Thats a good question. As you may know FJP is very critical for multi-core systems and it forks large tasks into smaller subtasks and then joins (waits) individual results to form final result (Divide and Conquer approach). We tried different number for parallelism (ie number of threads) in ForkJoinPool framework such as 8, 16, 32 and 64. The use case was a directory/file hierarchy structure of 40K+ and a nested tree-level. In this scenario we were able to demonstrate that a numberOfThreads=32 was ideal configuration. So our tests were based off these variables. FsShell commands (single threaded approach) - ls -R -> 12 mins du -> 14 mins count -> 14 mins FsShell commands (FJP approach) - ls -R -> 3 mins du -> 2 mins count -> 9 mins > Hdfs recursive listing operation is very slow > - > > Key: HDFS-13398 > URL: https://issues.apache.org/jira/browse/HDFS-13398 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.7.1 > Environment: HCFS file system where HDP 2.6.1 is connected to ECS > (Object Store). >Reporter: Ajay Sachdev >Assignee: Ajay Sachdev >Priority: Major > Fix For: 2.7.1 > > Attachments: HDFS-13398.001.patch, HDFS-13398.002.patch, > parallelfsPatch > > > The hdfs dfs -ls -R command is sequential in nature and is very slow for a > HCFS system. We have seen around 6 mins for 40K directory/files structure. > The proposal is to use multithreading approach to speed up recursive list, du > and count operations. > We have tried a ForkJoinPool implementation to improve performance for > recursive listing operation. > [https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli] > commit id : > 82387c8cd76c2e2761bd7f651122f83d45ae8876 > Another implementation is to use Java Executor Service to improve performance > to run listing operation in multiple threads in parallel. This has > significantly reduced the time to 40 secs from 6 mins. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13398) Hdfs recursive listing operation is very slow
[ https://issues.apache.org/jira/browse/HDFS-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16474335#comment-16474335 ] Ajay Sachdev commented on HDFS-13398: - Hi Yiqun, Thanks for taking the time to take a look at the patch. I have also uploaded the 2nd version of the patch against apache trunk. I would appreciate if you could provide some feedback against that version as well. This patch is intended for our customer which is using ViPRFS interface and they only want to run -ls -R/-du and count commands of FsShell. I agree that if there is any mutable shared state that any of sub-commands such as usagesTable in FsUsage.java class then we will wrap synchronized block around it to make it thread-safe. As a long term solution we would like to optimize all FsShell commands to take use of this ForkJoin Pool Framework and then all sub-commands implementation can ensure that any shared state are thread-safe. Thanks Ajay > Hdfs recursive listing operation is very slow > - > > Key: HDFS-13398 > URL: https://issues.apache.org/jira/browse/HDFS-13398 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.7.1 > Environment: HCFS file system where HDP 2.6.1 is connected to ECS > (Object Store). >Reporter: Ajay Sachdev >Assignee: Ajay Sachdev >Priority: Major > Fix For: 2.7.1 > > Attachments: HDFS-13398.001.patch, HDFS-13398.002.patch, > parallelfsPatch > > > The hdfs dfs -ls -R command is sequential in nature and is very slow for a > HCFS system. We have seen around 6 mins for 40K directory/files structure. > The proposal is to use multithreading approach to speed up recursive list, du > and count operations. > We have tried a ForkJoinPool implementation to improve performance for > recursive listing operation. > [https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli] > commit id : > 82387c8cd76c2e2761bd7f651122f83d45ae8876 > Another implementation is to use Java Executor Service to improve performance > to run listing operation in multiple threads in parallel. This has > significantly reduced the time to 40 secs from 6 mins. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13398) Hdfs recursive listing operation is very slow
[ https://issues.apache.org/jira/browse/HDFS-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16474147#comment-16474147 ] Ajay Sachdev commented on HDFS-13398: - Hi Mukul, I have attached the new patch to JIRA. Cheers, Ajay > Hdfs recursive listing operation is very slow > - > > Key: HDFS-13398 > URL: https://issues.apache.org/jira/browse/HDFS-13398 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.7.1 > Environment: HCFS file system where HDP 2.6.1 is connected to ECS > (Object Store). >Reporter: Ajay Sachdev >Assignee: Ajay Sachdev >Priority: Major > Fix For: 2.7.1 > > Attachments: HDFS-13398.001.patch, HDFS-13398.002.patch, > parallelfsPatch > > > The hdfs dfs -ls -R command is sequential in nature and is very slow for a > HCFS system. We have seen around 6 mins for 40K directory/files structure. > The proposal is to use multithreading approach to speed up recursive list, du > and count operations. > We have tried a ForkJoinPool implementation to improve performance for > recursive listing operation. > [https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli] > commit id : > 82387c8cd76c2e2761bd7f651122f83d45ae8876 > Another implementation is to use Java Executor Service to improve performance > to run listing operation in multiple threads in parallel. This has > significantly reduced the time to 40 secs from 6 mins. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13398) Hdfs recursive listing operation is very slow
[ https://issues.apache.org/jira/browse/HDFS-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajay Sachdev updated HDFS-13398: Attachment: HDFS-13398.002.patch > Hdfs recursive listing operation is very slow > - > > Key: HDFS-13398 > URL: https://issues.apache.org/jira/browse/HDFS-13398 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.7.1 > Environment: HCFS file system where HDP 2.6.1 is connected to ECS > (Object Store). >Reporter: Ajay Sachdev >Assignee: Ajay Sachdev >Priority: Major > Fix For: 2.7.1 > > Attachments: HDFS-13398.001.patch, HDFS-13398.002.patch, > parallelfsPatch > > > The hdfs dfs -ls -R command is sequential in nature and is very slow for a > HCFS system. We have seen around 6 mins for 40K directory/files structure. > The proposal is to use multithreading approach to speed up recursive list, du > and count operations. > We have tried a ForkJoinPool implementation to improve performance for > recursive listing operation. > [https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli] > commit id : > 82387c8cd76c2e2761bd7f651122f83d45ae8876 > Another implementation is to use Java Executor Service to improve performance > to run listing operation in multiple threads in parallel. This has > significantly reduced the time to 40 secs from 6 mins. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Issue Comment Deleted] (HDFS-13398) Hdfs recursive listing operation is very slow
[ https://issues.apache.org/jira/browse/HDFS-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajay Sachdev updated HDFS-13398: Comment: was deleted (was: HDFS-13398.002.patch) > Hdfs recursive listing operation is very slow > - > > Key: HDFS-13398 > URL: https://issues.apache.org/jira/browse/HDFS-13398 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.7.1 > Environment: HCFS file system where HDP 2.6.1 is connected to ECS > (Object Store). >Reporter: Ajay Sachdev >Assignee: Ajay Sachdev >Priority: Major > Fix For: 2.7.1 > > Attachments: HDFS-13398.001.patch, parallelfsPatch > > > The hdfs dfs -ls -R command is sequential in nature and is very slow for a > HCFS system. We have seen around 6 mins for 40K directory/files structure. > The proposal is to use multithreading approach to speed up recursive list, du > and count operations. > We have tried a ForkJoinPool implementation to improve performance for > recursive listing operation. > [https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli] > commit id : > 82387c8cd76c2e2761bd7f651122f83d45ae8876 > Another implementation is to use Java Executor Service to improve performance > to run listing operation in multiple threads in parallel. This has > significantly reduced the time to 40 secs from 6 mins. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13398) Hdfs recursive listing operation is very slow
[ https://issues.apache.org/jira/browse/HDFS-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16473104#comment-16473104 ] Ajay Sachdev commented on HDFS-13398: - HDFS-13398.002.patch > Hdfs recursive listing operation is very slow > - > > Key: HDFS-13398 > URL: https://issues.apache.org/jira/browse/HDFS-13398 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.7.1 > Environment: HCFS file system where HDP 2.6.1 is connected to ECS > (Object Store). >Reporter: Ajay Sachdev >Assignee: Ajay Sachdev >Priority: Major > Fix For: 2.7.1 > > Attachments: HDFS-13398.001.patch, parallelfsPatch > > > The hdfs dfs -ls -R command is sequential in nature and is very slow for a > HCFS system. We have seen around 6 mins for 40K directory/files structure. > The proposal is to use multithreading approach to speed up recursive list, du > and count operations. > We have tried a ForkJoinPool implementation to improve performance for > recursive listing operation. > [https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli] > commit id : > 82387c8cd76c2e2761bd7f651122f83d45ae8876 > Another implementation is to use Java Executor Service to improve performance > to run listing operation in multiple threads in parallel. This has > significantly reduced the time to 40 secs from 6 mins. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-13398) Hdfs recursive listing operation is very slow
[ https://issues.apache.org/jira/browse/HDFS-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472582#comment-16472582 ] Ajay Sachdev edited comment on HDFS-13398 at 5/11/18 8:14 PM: -- I have attached apache hadoop trunk multithreading diff. Please take a look at code review and provide comments. Thanks Ajay was (Author: ajaynmalu23): I have attached apache hadoop trunk multithreading diff. Please take a look at code review and provide comments. Thanks Ajay > Hdfs recursive listing operation is very slow > - > > Key: HDFS-13398 > URL: https://issues.apache.org/jira/browse/HDFS-13398 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.7.1 > Environment: HCFS file system where HDP 2.6.1 is connected to ECS > (Object Store). >Reporter: Ajay Sachdev >Assignee: Ajay Sachdev >Priority: Major > Fix For: 2.7.1 > > Attachments: HDFS-13398.001.patch, parallelfsPatch > > > The hdfs dfs -ls -R command is sequential in nature and is very slow for a > HCFS system. We have seen around 6 mins for 40K directory/files structure. > The proposal is to use multithreading approach to speed up recursive list, du > and count operations. > We have tried a ForkJoinPool implementation to improve performance for > recursive listing operation. > [https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli] > commit id : > 82387c8cd76c2e2761bd7f651122f83d45ae8876 > Another implementation is to use Java Executor Service to improve performance > to run listing operation in multiple threads in parallel. This has > significantly reduced the time to 40 secs from 6 mins. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13398) Hdfs recursive listing operation is very slow
[ https://issues.apache.org/jira/browse/HDFS-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472582#comment-16472582 ] Ajay Sachdev commented on HDFS-13398: - I have attached apache hadoop trunk multithreading diff. Please take a look at code review and provide comments. Thanks Ajay > Hdfs recursive listing operation is very slow > - > > Key: HDFS-13398 > URL: https://issues.apache.org/jira/browse/HDFS-13398 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.7.1 > Environment: HCFS file system where HDP 2.6.1 is connected to ECS > (Object Store). >Reporter: Ajay Sachdev >Assignee: Ajay Sachdev >Priority: Major > Fix For: 2.7.1 > > Attachments: HDFS-13398.001.patch, parallelfsPatch > > > The hdfs dfs -ls -R command is sequential in nature and is very slow for a > HCFS system. We have seen around 6 mins for 40K directory/files structure. > The proposal is to use multithreading approach to speed up recursive list, du > and count operations. > We have tried a ForkJoinPool implementation to improve performance for > recursive listing operation. > [https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli] > commit id : > 82387c8cd76c2e2761bd7f651122f83d45ae8876 > Another implementation is to use Java Executor Service to improve performance > to run listing operation in multiple threads in parallel. This has > significantly reduced the time to 40 secs from 6 mins. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13398) Hdfs recursive listing operation is very slow
[ https://issues.apache.org/jira/browse/HDFS-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajay Sachdev updated HDFS-13398: Attachment: HDFS-13398.001.patch > Hdfs recursive listing operation is very slow > - > > Key: HDFS-13398 > URL: https://issues.apache.org/jira/browse/HDFS-13398 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.7.1 > Environment: HCFS file system where HDP 2.6.1 is connected to ECS > (Object Store). >Reporter: Ajay Sachdev >Assignee: Ajay Sachdev >Priority: Major > Fix For: 2.7.1 > > Attachments: HDFS-13398.001.patch, parallelfsPatch > > > The hdfs dfs -ls -R command is sequential in nature and is very slow for a > HCFS system. We have seen around 6 mins for 40K directory/files structure. > The proposal is to use multithreading approach to speed up recursive list, du > and count operations. > We have tried a ForkJoinPool implementation to improve performance for > recursive listing operation. > [https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli] > commit id : > 82387c8cd76c2e2761bd7f651122f83d45ae8876 > Another implementation is to use Java Executor Service to improve performance > to run listing operation in multiple threads in parallel. This has > significantly reduced the time to 40 secs from 6 mins. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13398) Hdfs recursive listing operation is very slow
[ https://issues.apache.org/jira/browse/HDFS-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461122#comment-16461122 ] Ajay Sachdev commented on HDFS-13398: - Hi Mukul/Arpit, Sorry for late response. We would like to get a formal review process for this patch before we deploy on customer site. I will work on items #3 and #4 above. Also I was unable to find trunk branch in [https://github.com/hortonworks/hadoop-release.] So we have used HDP tag - HDP-2.6.2.0-205-tag I have also attached the patch against this tag. I would appreciate if you could take a look at code and get a review as well. Thanks Ajay[^HDFS-13398.001.patch] > Hdfs recursive listing operation is very slow > - > > Key: HDFS-13398 > URL: https://issues.apache.org/jira/browse/HDFS-13398 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.7.1 > Environment: HCFS file system where HDP 2.6.1 is connected to ECS > (Object Store). >Reporter: Ajay Sachdev >Assignee: Ajay Sachdev >Priority: Major > Fix For: 2.7.1 > > Attachments: HDFS-13398.001.patch, parallelfsPatch > > > The hdfs dfs -ls -R command is sequential in nature and is very slow for a > HCFS system. We have seen around 6 mins for 40K directory/files structure. > The proposal is to use multithreading approach to speed up recursive list, du > and count operations. > We have tried a ForkJoinPool implementation to improve performance for > recursive listing operation. > [https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli] > commit id : > 82387c8cd76c2e2761bd7f651122f83d45ae8876 > Another implementation is to use Java Executor Service to improve performance > to run listing operation in multiple threads in parallel. This has > significantly reduced the time to 40 secs from 6 mins. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13398) Hdfs recursive listing operation is very slow
[ https://issues.apache.org/jira/browse/HDFS-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajay Sachdev updated HDFS-13398: Attachment: parallelfsPatch > Hdfs recursive listing operation is very slow > - > > Key: HDFS-13398 > URL: https://issues.apache.org/jira/browse/HDFS-13398 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.7.1 > Environment: HCFS file system where HDP 2.6.1 is connected to ECS > (Object Store). >Reporter: Ajay Sachdev >Priority: Major > Fix For: 2.7.1 > > Attachments: parallelfsPatch > > > The hdfs dfs -ls -R command is sequential in nature and is very slow for a > HCFS system. We have seen around 6 mins for 40K directory/files structure. > The proposal is to use multithreading approach to speed up recursive list, du > and count operations. > We have tried a ForkJoinPool implementation to improve performance for > recursive listing operation. > [https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli] > commit id : > 82387c8cd76c2e2761bd7f651122f83d45ae8876 > Another implementation is to use Java Executor Service to improve performance > to run listing operation in multiple threads in parallel. This has > significantly reduced the time to 40 secs from 6 mins. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-13398) Hdfs recursive listing operation is very slow
Ajay Sachdev created HDFS-13398: --- Summary: Hdfs recursive listing operation is very slow Key: HDFS-13398 URL: https://issues.apache.org/jira/browse/HDFS-13398 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Affects Versions: 2.7.1 Environment: HCFS file system where HDP 2.6.1 is connected to ECS (Object Store). Reporter: Ajay Sachdev Fix For: 2.7.1 The hdfs dfs -ls -R command is sequential in nature and is very slow for a HCFS system. We have seen around 6 mins for 40K directory/files structure. The proposal is to use multithreading approach to speed up recursive list, du and count operations. We have tried a ForkJoinPool implementation to improve performance for recursive listing operation. [https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli] commit id : 82387c8cd76c2e2761bd7f651122f83d45ae8876 Another implementation is to use Java Executor Service to improve performance to run listing operation in multiple threads in parallel. This has significantly reduced the time to 40 secs from 6 mins. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org