We had some discussions at Salesforce regarding the HBase client.

There are various layers of connection/thread/HTable pooling that make the 
client unwieldy in a setting with long running clients
such as an application server. HTable has its own thread pool (for batch 
operations) and is not thread safe, and behind the scenes creating HTables will 
create HConnectionImplemenations when necessary. In order to side step some of 
the performance issues HTablePool was
created. In an application server setting, though, this can lead to quite 
byzantine setups with many threads created and without a central authority 
monitoring the number of these threads (can be limited per HTable, the pool can 
be limited, etc, but it is hard to reason about globally).


The HConnection code itself seems clean, so ideally what we want to do is 
managing HConnections. These are what identify and connect us to a cluster, and 
should drive everything.

HConnectionImplementation could have the thread pool on it that is shared by 
the various HTables, and HTables could be created by a method on HConnection.

Now, I imagine this will find some opposition here... :)

So... An alternate (stop-gap) approach is to add a new constructor to HTable 
that allows for an optional HConnection and a ThreadPool to be provided. If 
provided, HTable would not close the pool or the connection on close, and would 
become a fairly lightweight and predictable object that we can create for the 
duration of a request and dump it afterwards. (Even the meta regions for the 
tables should eventually be cached in the HConnectionImplementation).
The application would manage the HConnection and the ThreadPool and create 
HTables when needed.


Long term it seems a new client is needed (maybe based on asynchbase, with an 
additional synchronous layer), but that is a different story.


Thoughts/comments/better ideas? What do other folks do?


-- Lars

Reply via email to