[
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187755#comment-15187755
]
Colin Patrick McCabe commented on HDFS-9924:
--------------------------------------------
Currently the NameNode can handle between 10k and 100k operations per second,
depending on configuration and the nature of the operations. It seems like you
should be able to comfortably dispatch that many operations from a few thousand
client threads performing synchronous RPC calls... bearing in mind that each
operation will take a few milliseconds on average. This is assuming that you
want to consume all the available NN RPC bandwidth from a single client node.
Perhaps I'm missing something, but I don't see how async operations will
improve performance here. The overhead of a few thousand threads on the client
is small, and certainly not what is limiting HDFS performance. Rather,
performance is limited by considerations like the locking on the NameNode, Java
garbage collections on the NameNode, and serialization/deserialization
overheads.
Please keep in mind that you don't need async operations to reuse connections
and sockets... we do that already via mechanisms like the {{PeerCache}}
(formerly {{SocketCache}}). Clearly, Hive can also dispatch operations in
parallel using standard mechanisms like an Executor or ThreadPool. I certainly
don't object to implementing this, but if the goal is better performance, I
think you are going to be disappointed. Perhaps I have missed something,
though... I'm curious if there are reasons for implementing this that I have
not considered.
> [umbrella] Asynchronous HDFS Access
> -----------------------------------
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: fs
> Reporter: Tsz Wo Nicholas Sze
> Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked
> until the method returns. It is very slow if a client makes a large number
> of independent calls in a single thread since each call has to wait until the
> previous call is finished. It is inefficient if a client needs to create a
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is
> not blocked. The methods in the new API immediately return a Java Future
> object. The return value can be obtained by the usual Future.get() method.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)