[ 
https://issues.apache.org/jira/browse/HBASE-16960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15619357#comment-15619357
 ] 

stack commented on HBASE-16960:
-------------------------------

I can't say for sure that the patch will fix the problem and so think we should 
add on the long wait on sync and call abort if we time out just-in-case.

An append throws an exception (usually never happens). We set exception in 
onEvent so all subsequent appends will get this exception but we keep pulling 
on the ringbuffer to clear it out.

We schedule a roll of the log. The roll fails because many (8k? Is that 
possible?) appends have gone into the log and they have not been ACK'd with a 
sync so we will fail the roll and call for an ABORT of the server to replay 
logs.

Now, I can't tell for sure what state we are in. Batching in the RingBuffer is 
basic. It is just whatever is there since the last time we went to pull from 
the ringbuffer. A batch would have to have been something like append, append, 
sync, sync, append.... i..e. an append came in after some syncs... which is 
possible of course. In this case, I think your patch will help clearing out 
unoffered syncrunners ... the syncs that came in before the append that failed. 
If no new sync comes around the ringbuffer, these are just going to hang out. 
It looks like we are so busy trying to ABORT, we neglect to schedule these 
SyncFutures.

Can you reproduce?

Thanks for digging in on this one [~carp84] and [~aoxiang]

> RegionServer hang when aborting
> -------------------------------
>
>                 Key: HBASE-16960
>                 URL: https://issues.apache.org/jira/browse/HBASE-16960
>             Project: HBase
>          Issue Type: Bug
>            Reporter: binlijin
>            Assignee: binlijin
>         Attachments: HBASE-16960.patch, RingBufferEventHandler.png, 
> RingBufferEventHandler_exception.png, SyncFuture.png, 
> SyncFuture_exception.png, rs1081.jstack
>
>
> We see regionserver hang when aborting several times and cause all regions on 
> this regionserver out of service and then all affected applications stop 
> works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to