[jira] [Commented] (HDFS-10679) libhdfs++: Implement parallel find with wildcards tool

Anatoli Shein (JIRA) Wed, 24 Aug 2016 14:10:56 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15435692#comment-15435692
 ]


Anatoli Shein commented on HDFS-10679:
--------------------------------------

Thanks for the review, [~bobhansen].

I will close this jira now since I am moving the changes from here to 
HDFS-10754.

I have addressed your comments as follows:

FS::Find:
* Async - Callback deliver results with a const std::vector<StatInfo> &, not a 
shared_ptr. This is a signal to the consumer to use the data delivered during 
the callback, but don't use the passed-in container.
(/) Done, also fixed GetListing Async.
* Likewise, the synchronous call should take a non-const std::vector<StatInfo> 
* for an output parameter, signaling to the consumer that we are going to 
mutate their input vector
(/) Done, also fixed GetListing Sync.
* We need a very clear threading model. Will the handler be called concurrently 
from multiple threads (currently, yes. If we ever get on asio fibers, we should 
make it a no, because we love our consumers)
(i) I agree. We might need to make a jira for that.
* We're doing a lot of dynamic memory allocation during recursion. Could we 
restructure things a little to not copy the entirety of the FindState and 
RecursionState on each call? It appears that they each have one element that is 
being updated for each recursive call
(/) I separated the state into CurrentState and SharedState. SharedState is 
never copied now.
* We need to hold the lock while incrementing the recursion_counter also
(i) recursion_counter is atomic and in our case increments are never paired 
with read accessed, so they do not need locking.
* If the handler returns false (don't want more) at the end of the function, do 
we do anything to prevent more from being delivered? Should we push that into 
the shared find_state and bail out for any subsequent NN responses?
(/) I added a variable "aborted" that stopps recursion when user does not want 
anymore.
find.cpp:
* Like the cat examples, simplify as much as possible. Nuke URI parsing, etc.
(/) Done.
* Expand smth_found to something_found to prevent confusion (especially in an 
example)
(/) Done.
* We have race conditions if one thread is outputting the previous block while 
another thread gets a final block (or error).
(/) Fixed by locking the handler.
FS::GetFileInfo should populate the full_path member also
(/) Done.

> libhdfs++: Implement parallel find with wildcards tool
> ------------------------------------------------------
>
>                 Key: HDFS-10679
>                 URL: https://issues.apache.org/jira/browse/HDFS-10679
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: Anatoli Shein
>            Assignee: Anatoli Shein
>         Attachments: HDFS-10679.HDFS-8707.000.patch, 
> HDFS-10679.HDFS-8707.001.patch, HDFS-10679.HDFS-8707.002.patch, 
> HDFS-10679.HDFS-8707.003.patch, HDFS-10679.HDFS-8707.004.patch, 
> HDFS-10679.HDFS-8707.005.patch, HDFS-10679.HDFS-8707.006.patch, 
> HDFS-10679.HDFS-8707.007.patch, HDFS-10679.HDFS-8707.008.patch, 
> HDFS-10679.HDFS-8707.009.patch, HDFS-10679.HDFS-8707.010.patch, 
> HDFS-10679.HDFS-8707.011.patch, HDFS-10679.HDFS-8707.012.patch, 
> HDFS-10679.HDFS-8707.013.patch
>
>
> The find tool will issue the GetListing namenode operation on a given 
> directory, and filter the results using posix globbing library.
> If the recursive option is selected, for each returned entry that is a 
> directory the tool will issue another asynchronous call GetListing and repeat 
> the result processing in a recursive fashion.
> One implementation issue that needs to be addressed is the way how results 
> are returned back to the user: we can either buffer the results and return 
> them to the user in bulk, or we can return results continuously as they 
> arrive. While buffering would be an easier solution, returning results as 
> they arrive would be more beneficial to the user in terms of performance, 
> since the result processing can start as soon as the first results arrive 
> without any delay. In order to do that we need the user to use a loop to 
> process arriving results, and we need to send a special message back to the 
> user when the search is over.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-10679) libhdfs++: Implement parallel find with wildcards tool

Reply via email to