[jira] [Commented] (HDFS-10679) libhdfs++: Implement parallel find with wildcards tool

James Clampffer (JIRA) Fri, 12 Aug 2016 10:10:17 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419141#comment-15419141
 ]


James Clampffer commented on HDFS-10679:
----------------------------------------

Awesome patch! Your benchmarks demonstrate a lot of the reasons this library 
was built in the first place.  Not only is it way faster than the Java client, 
it's using significant fewer resources.  Things like page faults and contexts 
switches slow down the program but also have significant externalized costs in 
the form of cache/TLB pollution, extra use of bus bandwidth, and extra IRQ 
handling that impact everything else running on the system.  Not much you can 
do if you're stuck in a java environment because cpu/memory bound things are 
never going to win there, but for people writing new code in C/C++ minimizing 
those costs is a huge win.

About the code:
1)
{code}
void FileSystemImpl::FindShim(const Status &stat, 
std::shared_ptr<std::vector<StatInfo>> stat_infos, bool directory_has_more,
    std::string path, const std::string &name, std::shared_ptr<uint64_t> 
recursion_counter,
    std::shared_ptr<std::mutex> lock, std::shared_ptr<std::vector<std::string>> 
dirs, uint32_t position,
    bool searchPath, const std::function<bool(const Status &, 
std::shared_ptr<std::vector<StatInfo>>, bool)> &handler) {
{code}
Using this many arguments for a large function makes it really hard to 
distinguish what are local variables and what was passed in.  Could you bundle 
these up into a struct/class that represents the state?  That way any time a 
developer sees some_arg_struct->lock they can infer that it was passed in as an 
argument.  The other benefit that this gives is later on, when you do lambda 
capture by \[=\] you could just explicitly bind to that struct type for the 
recursion.  This code tends to be very dense so being explicit with capture 
lists and type names e.g. avoiding "auto" in non-trivial statements can do a 
lot to improve maintainability.

2) You have a lot of good comments about the control flow.  Could you add a few 
about higher level design?


> libhdfs++: Implement parallel find with wildcards tool
> ------------------------------------------------------
>
>                 Key: HDFS-10679
>                 URL: https://issues.apache.org/jira/browse/HDFS-10679
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: Anatoli Shein
>            Assignee: Anatoli Shein
>         Attachments: HDFS-10679.HDFS-8707.000.patch, 
> HDFS-10679.HDFS-8707.001.patch, HDFS-10679.HDFS-8707.002.patch, 
> HDFS-10679.HDFS-8707.003.patch, HDFS-10679.HDFS-8707.004.patch, 
> HDFS-10679.HDFS-8707.005.patch, HDFS-10679.HDFS-8707.006.patch, 
> HDFS-10679.HDFS-8707.007.patch, HDFS-10679.HDFS-8707.008.patch, 
> HDFS-10679.HDFS-8707.009.patch
>
>
> The find tool will issue the GetListing namenode operation on a given 
> directory, and filter the results using posix globbing library.
> If the recursive option is selected, for each returned entry that is a 
> directory the tool will issue another asynchronous call GetListing and repeat 
> the result processing in a recursive fashion.
> One implementation issue that needs to be addressed is the way how results 
> are returned back to the user: we can either buffer the results and return 
> them to the user in bulk, or we can return results continuously as they 
> arrive. While buffering would be an easier solution, returning results as 
> they arrive would be more beneficial to the user in terms of performance, 
> since the result processing can start as soon as the first results arrive 
> without any delay. In order to do that we need the user to use a loop to 
> process arriving results, and we need to send a special message back to the 
> user when the search is over.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-10679) libhdfs++: Implement parallel find with wildcards tool

Reply via email to