[ 
https://issues.apache.org/jira/browse/HBASE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838137#action_12838137
 ] 

Karthik Ranganathan commented on HBASE-2023:
--------------------------------------------

Kannan and I took a look at this issue and came up with yet another possibility 
in addition to the 3 JD mentioned:

Move the synchronized block inside the try catch loop just around the 
getClosestRowBefore() call. This causes each thread to give up the lock before 
sleeping to retry. This allows other threads to make a call in case one 
particular region was offline. In addition, if useCache is true, we can look at 
the cache and return the region right away without ever entering the 
synchronized section. So the new workflow in  locateRegionInMeta() will look as 
follows:

1. If useCache is true and the region is in the cache, return the region. If 
not, We have to make a remote call. 
2. for the number of retries
3.   wait for lock
4.   check cache again (someone could have filled the cache while we were 
waiting). Return if found.
5.   make the remote call
6.   release lock
7.   return on success, otherwise usual error handling/sleep, goto 2

I can work on the fix if this sounds good to you guys.


> Client sync block can cause 1 thread of a multi-threaded client to block all 
> others
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-2023
>                 URL: https://issues.apache.org/jira/browse/HBASE-2023
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: ryan rawson
>
> Take a highly multithreaded client, processing a few thousand requests a 
> second.  If a table goes offline, one thread will get stuck in 
> "locateRegionInMeta" which is located inside the following sync block:
>         synchronized(userRegionLock){
>           return locateRegionInMeta(META_TABLE_NAME, tableName, row, 
> useCache);
>         }
> So when other threads need to find a region (EVEN IF ITS CACHED!!!) it will 
> encounter this sync and wait. 
> This can become an issue on a busy thrift server (where I first noticed the 
> problem), one region offline can prevent access to all other regions!
> Potential solution: narrow this lock, or perhaps just get rid of it 
> completely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to