[
https://issues.apache.org/jira/browse/HBASE-16960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15619357#comment-15619357
]
stack commented on HBASE-16960:
-------------------------------
I can't say for sure that the patch will fix the problem and so think we should
add on the long wait on sync and call abort if we time out just-in-case.
An append throws an exception (usually never happens). We set exception in
onEvent so all subsequent appends will get this exception but we keep pulling
on the ringbuffer to clear it out.
We schedule a roll of the log. The roll fails because many (8k? Is that
possible?) appends have gone into the log and they have not been ACK'd with a
sync so we will fail the roll and call for an ABORT of the server to replay
logs.
Now, I can't tell for sure what state we are in. Batching in the RingBuffer is
basic. It is just whatever is there since the last time we went to pull from
the ringbuffer. A batch would have to have been something like append, append,
sync, sync, append.... i..e. an append came in after some syncs... which is
possible of course. In this case, I think your patch will help clearing out
unoffered syncrunners ... the syncs that came in before the append that failed.
If no new sync comes around the ringbuffer, these are just going to hang out.
It looks like we are so busy trying to ABORT, we neglect to schedule these
SyncFutures.
Can you reproduce?
Thanks for digging in on this one [~carp84] and [~aoxiang]
> RegionServer hang when aborting
> -------------------------------
>
> Key: HBASE-16960
> URL: https://issues.apache.org/jira/browse/HBASE-16960
> Project: HBase
> Issue Type: Bug
> Reporter: binlijin
> Assignee: binlijin
> Attachments: HBASE-16960.patch, RingBufferEventHandler.png,
> RingBufferEventHandler_exception.png, SyncFuture.png,
> SyncFuture_exception.png, rs1081.jstack
>
>
> We see regionserver hang when aborting several times and cause all regions on
> this regionserver out of service and then all affected applications stop
> works.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)