[ https://issues.apache.org/jira/browse/HDFS-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15417660#comment-15417660 ]
Anatoli Shein commented on HDFS-10679: -------------------------------------- Also, I just ran a test using ā/usr/bin/time āvā to measure memory consumption on the following directory structure: root dir / 10 dirs / 1500 dirs each (15011 directories total) Our output: Command being timed: "find hdfs://localhost.localdomain:9433/ * 1" User time (seconds): 0.33 System time (seconds): 0.11 Percent of CPU this job got: 54% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.82 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 17948 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 2835 Voluntary context switches: 4297 Involuntary context switches: 27 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 Java Hadoop output: Command being timed: "hadoop fs -ls -R hdfs://localhost.localdomain:9433/" User time (seconds): 14.19 System time (seconds): 7.68 Percent of CPU this job got: 142% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.39 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 293088 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 84515 Voluntary context switches: 82654 Involuntary context switches: 18714 Swaps: 0 File system inputs: 0 File system outputs: 112 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 So we are using 18 Mb of memory vs javas 293 Mb (about 16x less). And our execution time here is also 19x faster. I am also planning to run a test with a million directories. > libhdfs++: Implement parallel find with wildcards tool > ------------------------------------------------------ > > Key: HDFS-10679 > URL: https://issues.apache.org/jira/browse/HDFS-10679 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client > Reporter: Anatoli Shein > Assignee: Anatoli Shein > Attachments: HDFS-10679.HDFS-8707.000.patch, > HDFS-10679.HDFS-8707.001.patch, HDFS-10679.HDFS-8707.002.patch, > HDFS-10679.HDFS-8707.003.patch, HDFS-10679.HDFS-8707.004.patch, > HDFS-10679.HDFS-8707.005.patch, HDFS-10679.HDFS-8707.006.patch, > HDFS-10679.HDFS-8707.007.patch, HDFS-10679.HDFS-8707.008.patch, > HDFS-10679.HDFS-8707.009.patch > > > The find tool will issue the GetListing namenode operation on a given > directory, and filter the results using posix globbing library. > If the recursive option is selected, for each returned entry that is a > directory the tool will issue another asynchronous call GetListing and repeat > the result processing in a recursive fashion. > One implementation issue that needs to be addressed is the way how results > are returned back to the user: we can either buffer the results and return > them to the user in bulk, or we can return results continuously as they > arrive. While buffering would be an easier solution, returning results as > they arrive would be more beneficial to the user in terms of performance, > since the result processing can start as soon as the first results arrive > without any delay. In order to do that we need the user to use a loop to > process arriving results, and we need to send a special message back to the > user when the search is over. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org