[
https://issues.apache.org/jira/browse/HBASE-27768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17710278#comment-17710278
]
Bryan Beaudreault commented on HBASE-27768:
-------------------------------------------
We were never able to reliably reproduce the issue in a test environment, but
this patch resolved the issue for us in production. Previously we were hitting
the above issue ever 1-3 days. It's now been about a week with the fix deployed
and we haven't hit it.
> Race conditions in BlockingRpcConnection
> ----------------------------------------
>
> Key: HBASE-27768
> URL: https://issues.apache.org/jira/browse/HBASE-27768
> Project: HBase
> Issue Type: Bug
> Reporter: Bryan Beaudreault
> Assignee: Bryan Beaudreault
> Priority: Major
> Labels: patch-available
> Fix For: 2.6.0, 2.5.5, 2.4.18
>
>
> We've been experiencing strange timeouts since upgrading to hbase2 client. We
> use BlockingRpcConnection for now until we migrate our auth stack to native
> TLS. In diagnosing the timeouts, I noticed a few issues in this class:
> # Most importantly, there is a race condition which can result in a case
> where a BlockingRpcConnection instance has 2 reader threads running. In this
> case, both are competing for the socket and it causes weird timeouts and in
> some cases corrupted response (i.e. InvalidProtocolBufferException)
> # The waitForWork loop does not properly handle interruption. When it gets
> interrupted, if the above race condition occurs, the waitForWork loop ends up
> forever being in a tight loop. The "wait()" call instantly throws
> InterruptedException, and we set interrupted state back and restart the loop.
> So no waiting is occurring anymore.
> The race condition is somewhat rare, only occurring in certain failure
> scenarios on our highest volume clients. But when it happens, a low level of
> errors will forever be thrown for the affected server connection until the
> client is bounced.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)