Re: HBase client connection management changes from HBase 1.x to 2.x

Michael Miklavcic Mon, 19 Aug 2019 18:55:12 -0700

Duo, thanks so much for the reply. This is great feedback.

1. The connections being threadsafe makes this even easier. Any performance
concerns with a massively parallel Storm topology? Off the top of my head,
we recently had a user with instances running something like 25 workers and
parallelism for the associated bolt over 200. Depending on the allocation,
I think that throws roughly 8 tasks per process. Any notion as to an upper
end of parallelism on those connections that we should look out for? We are
exclusively doing GET operations on custom rowkeys that are aggressively
cached, if that helps. The main concern is that once we release to our user
base, when they scale out we are limited in how we scale out with the
process vs connection association at that point. Should we be more
concerned with overall number of connections or with concurrent threads
accessing those connections, if it matters at all?
2. We're only hitting a single HBase instance in our case. It's possible we
might have a need to expand at some point in the future, but nothing
near-term. This is all single HBase cluster config detail at this point.
Any variability with the config is purely application-wide at this point.
It's all or nothing.


Thanks,
Mike

On Mon, Aug 19, 2019 at 7:11 PM 张铎(Duo Zhang) <[email protected]> wrote:

> Generally speaking, you do not need a 'Connection Pool', just use a single
> Connection across the whole process. And I do not think not closing the
> connection if there is only a singleton Connection across the whole
> process, as for Storm, you will only quit when the long running task
> quits(which means the process is done?)
>
> I checked the PR, I think using a TableProvider is fine. Two questions:
>
> 1. Usually you can use a single Connection as it is thread safe, and only
> use ThreadLocal to cache Tables. And now, creating Table is a low cost
> operation, it just creates some in memory objects. So maybe you even do not
> need to use ThreadLocal here.
> 2. Will the configuration passed to the TableProvider be different? If you
> need to go to different HBase clusters, then you need to a Map to cache
> different Connections as a single Connection can only be used to communcate
> with one HBase cluster. And if it is only some different timeout values,
> you can see the TableBuilder interface. You can specify them when getting a
> Table.
>
> Thanks.
>
> Michael Miklavcic <[email protected]> 于2019年8月20日周二 上午5:20写道：
>
> > Hi HBase dev community,
> >
> > I'm Michael Miklavcic, PMC/committer on Apache Metron and we're heavy
> users
> > of Apache HBase. We're currently going through a major Hadoop stack
> upgrade
> > that includes an upgrade from HBase 1.1.2 to 2.0.2 and would appreciate
> > some guidance on the new connection management guarantees. The biggest
> risk
> > and code change we see right now is the old HTableInterface client API
> > deprecations where HBase connections are no longer managed under the hood
> > by HTable. The new API suggests opening a long-running connection and
> > opening/closing Tables retrieved from that connection in an ad-hoc
> manner.
> > We currently run long-lived HBase connections in a Storm topology,
> > generally sharing those original tables on a per-thread basis. We do not
> go
> > to any extraordinary lengths to close any of our open HBase tables - they
> > are left open for the duration of the topology. There are some
> > close/cleanup hooks, but I don't think they are consistently applied
> > throughout the architecture. In the new API, it's unclear to me what the
> > connection retry/fail semantics will look like for instances where a
> Table
> > is created from a connection that is closed or has gone stale.
> >
> >    1. Is there any logic built into the underlying table to refresh the
> >    connections, or is it entirely up to the client to fail, create a new
> >    connection, create a new table reference, and retry the operation?
> >    2. What exception/retry semantics should we expect when performing a
> >    Table operation if the connection times out, other than perhaps an
> >    IOException?
> >    3. How is a Table coupled to a connection under the hood in the new
> API?
> >    4. We're looking to minimize the overall architectural impact of our
> >    upgrade. I took a go at it here (
> >
> >
> https://github.com/apache/metron/pull/1483/files#diff-d2799e20727b64e65da6f6ed2e95a2f0R56
> > )
> >    by expanding on a "TableProvider" abstraction we leverage for our
> HBase
> >    interactions. I've isolated the connection management to this class
> on a
> >    per-thread basis for running in a long-running Storm topology.
> >
> > Per #4, I'm wondering if this approach is reasonable, or whether we need
> to
> > seriously consider completely rewriting how we manage our interactions
> with
> > HBase, including a more robust connection pooling solution. We want to
> > emphasize the smallest change possible, considering the overall risk of
> > this major upgrade.
> >
> > Best,
> > Mike Miklavcic
> > PMC Apache Metron
> >
>

Re: HBase client connection management changes from HBase 1.x to 2.x

Reply via email to