We had some discussions at Salesforce regarding the HBase client. There are various layers of connection/thread/HTable pooling that make the client unwieldy in a setting with long running clients such as an application server. HTable has its own thread pool (for batch operations) and is not thread safe, and behind the scenes creating HTables will create HConnectionImplemenations when necessary. In order to side step some of the performance issues HTablePool was created. In an application server setting, though, this can lead to quite byzantine setups with many threads created and without a central authority monitoring the number of these threads (can be limited per HTable, the pool can be limited, etc, but it is hard to reason about globally).
The HConnection code itself seems clean, so ideally what we want to do is managing HConnections. These are what identify and connect us to a cluster, and should drive everything. HConnectionImplementation could have the thread pool on it that is shared by the various HTables, and HTables could be created by a method on HConnection. Now, I imagine this will find some opposition here... :) So... An alternate (stop-gap) approach is to add a new constructor to HTable that allows for an optional HConnection and a ThreadPool to be provided. If provided, HTable would not close the pool or the connection on close, and would become a fairly lightweight and predictable object that we can create for the duration of a request and dump it afterwards. (Even the meta regions for the tables should eventually be cached in the HConnectionImplementation). The application would manage the HConnection and the ThreadPool and create HTables when needed. Long term it seems a new client is needed (maybe based on asynchbase, with an additional synchronous layer), but that is a different story. Thoughts/comments/better ideas? What do other folks do? -- Lars
