[jira] [Commented] (HDFS-10679) libhdfs++: Implement parallel find with wildcards tool

Anatoli Shein (JIRA) Thu, 11 Aug 2016 10:49:54 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15417660#comment-15417660
 ]


Anatoli Shein commented on HDFS-10679:
--------------------------------------

Also, I just ran a test using “/usr/bin/time –v” to measure memory consumption 
on the following directory structure:
root dir / 10 dirs / 1500 dirs each
(15011 directories total)

Our output:
               Command being timed: "find hdfs://localhost.localdomain:9433/ * 
1"
               User time (seconds): 0.33
               System time (seconds): 0.11
               Percent of CPU this job got: 54%
               Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.82
               Average shared text size (kbytes): 0
               Average unshared data size (kbytes): 0
               Average stack size (kbytes): 0
               Average total size (kbytes): 0
               Maximum resident set size (kbytes): 17948
               Average resident set size (kbytes): 0
               Major (requiring I/O) page faults: 0
               Minor (reclaiming a frame) page faults: 2835
               Voluntary context switches: 4297
               Involuntary context switches: 27
               Swaps: 0
               File system inputs: 0
               File system outputs: 0
               Socket messages sent: 0
               Socket messages received: 0
               Signals delivered: 0
               Page size (bytes): 4096
               Exit status: 0

Java Hadoop output:
               Command being timed: "hadoop fs -ls -R 
hdfs://localhost.localdomain:9433/"
               User time (seconds): 14.19
               System time (seconds): 7.68
               Percent of CPU this job got: 142%
               Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.39
               Average shared text size (kbytes): 0
               Average unshared data size (kbytes): 0
               Average stack size (kbytes): 0
               Average total size (kbytes): 0
               Maximum resident set size (kbytes): 293088
               Average resident set size (kbytes): 0
               Major (requiring I/O) page faults: 0
               Minor (reclaiming a frame) page faults: 84515
               Voluntary context switches: 82654
               Involuntary context switches: 18714
               Swaps: 0
               File system inputs: 0
               File system outputs: 112
               Socket messages sent: 0
               Socket messages received: 0
               Signals delivered: 0
               Page size (bytes): 4096
               Exit status: 0

So we are using 18 Mb of memory vs javas 293 Mb (about 16x less).
And our execution time here is also 19x faster.

I am also planning to run a test with a million directories.

> libhdfs++: Implement parallel find with wildcards tool
> ------------------------------------------------------
>
>                 Key: HDFS-10679
>                 URL: https://issues.apache.org/jira/browse/HDFS-10679
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: Anatoli Shein
>            Assignee: Anatoli Shein
>         Attachments: HDFS-10679.HDFS-8707.000.patch, 
> HDFS-10679.HDFS-8707.001.patch, HDFS-10679.HDFS-8707.002.patch, 
> HDFS-10679.HDFS-8707.003.patch, HDFS-10679.HDFS-8707.004.patch, 
> HDFS-10679.HDFS-8707.005.patch, HDFS-10679.HDFS-8707.006.patch, 
> HDFS-10679.HDFS-8707.007.patch, HDFS-10679.HDFS-8707.008.patch, 
> HDFS-10679.HDFS-8707.009.patch
>
>
> The find tool will issue the GetListing namenode operation on a given 
> directory, and filter the results using posix globbing library.
> If the recursive option is selected, for each returned entry that is a 
> directory the tool will issue another asynchronous call GetListing and repeat 
> the result processing in a recursive fashion.
> One implementation issue that needs to be addressed is the way how results 
> are returned back to the user: we can either buffer the results and return 
> them to the user in bulk, or we can return results continuously as they 
> arrive. While buffering would be an easier solution, returning results as 
> they arrive would be more beneficial to the user in terms of performance, 
> since the result processing can start as soon as the first results arrive 
> without any delay. In order to do that we need the user to use a loop to 
> process arriving results, and we need to send a special message back to the 
> user when the search is over.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-10679) libhdfs++: Implement parallel find with wildcards tool

Reply via email to