[ 
https://issues.apache.org/jira/browse/HBASE-27768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17710278#comment-17710278
 ] 

Bryan Beaudreault commented on HBASE-27768:
-------------------------------------------

We were never able to reliably reproduce the issue in a test environment, but 
this patch resolved the issue for us in production. Previously we were hitting 
the above issue ever 1-3 days. It's now been about a week with the fix deployed 
and we haven't hit it.

> Race conditions in BlockingRpcConnection
> ----------------------------------------
>
>                 Key: HBASE-27768
>                 URL: https://issues.apache.org/jira/browse/HBASE-27768
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Bryan Beaudreault
>            Assignee: Bryan Beaudreault
>            Priority: Major
>              Labels: patch-available
>             Fix For: 2.6.0, 2.5.5, 2.4.18
>
>
> We've been experiencing strange timeouts since upgrading to hbase2 client. We 
> use BlockingRpcConnection for now until we migrate our auth stack to native 
> TLS. In diagnosing the timeouts, I noticed a few issues in this class:
>  # Most importantly, there is a race condition which can result in a case 
> where a BlockingRpcConnection instance has 2 reader threads running. In this 
> case, both are competing for the socket and it causes weird timeouts and in 
> some cases corrupted response (i.e. InvalidProtocolBufferException)
>  # The waitForWork loop does not properly handle interruption. When it gets 
> interrupted, if the above race condition occurs, the waitForWork loop ends up 
> forever being in a tight loop. The "wait()" call instantly throws 
> InterruptedException, and we set interrupted state back and restart the loop. 
> So no waiting is occurring anymore.
> The race condition is somewhat rare, only occurring in certain failure 
> scenarios on our highest volume clients. But when it happens, a low level of 
> errors will forever be thrown for the affected server connection until the 
> client is bounced.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to