Bryan Beaudreault created HBASE-27768:
-----------------------------------------
Summary: Race conditions in BlockingRpcConnection
Key: HBASE-27768
URL: https://issues.apache.org/jira/browse/HBASE-27768
Project: HBase
Issue Type: Bug
Reporter: Bryan Beaudreault
We've been experiencing strange timeouts since upgrading to hbase2 client. We
use BlockingRpcConnection for now until we migrate our auth stack to native
TLS. In diagnosing the timeouts, I noticed a few issues in this class:
# Most importantly, there is a race condition which can result in a case where
a BlockingRpcConnection instance has 2 reader threads running. In this case,
both are competing for the socket and it causes weird timeouts and in some
cases corrupted response (i.e. InvalidProtocolBufferException)
# The waitForWork loop does not properly handle interruption. When it gets
interrupted, if the above race condition occurs, the waitForWork loop ends up
forever being in a tight loop. The "wait()" call instantly throws
InterruptedException, and we set interrupted state back and restart the loop.
So no waiting is occurring anymore.
The race condition is somewhat rare, only occurring in certain failure
scenarios on our highest volume clients. But when it happens, a low level of
errors will forever be thrown for the affected server connection until the
client is bounced.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)