[
https://issues.apache.org/jira/browse/HBASE-16960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15617495#comment-15617495
]
stack commented on HBASE-16960:
-------------------------------
I looked at the patch.
I was concerned that it was cancelling already running syncs but it does not
seem to do that. We do not want to stop currently running syncs. They were
started before this failed append. If they succeed, no dataloss. A few
handlers are going to get IOExceptions but all up to the failed append will
have been synced. If they do not succeed, then could be data loss but
syncrunner should be screaming to kill the RegionServer so it will replay logs.
bq. But the problem in this JIRA is some case that there's no further syncs
after append fails, and causing an isolated sync then infinite wait.
Yes. We seem to keep turning up corner cases that can bring about this stuck
state. It is a weakness of the implementation that every append must be
followed by a sync else the machinery gets stuck. [~aoxiang] suggests a
timeout. I think a long timeout that takes a look around to see what the state
of things is and rethrows an abort if appropriate is something that I wanted to
avoid but it seems sensible after seeing this the second or third lockup that
has been caught out in the wild.
Thanks lads for digging in on this tough one.
> RegionServer hang when aborting
> -------------------------------
>
> Key: HBASE-16960
> URL: https://issues.apache.org/jira/browse/HBASE-16960
> Project: HBase
> Issue Type: Bug
> Reporter: binlijin
> Assignee: binlijin
> Attachments: HBASE-16960.patch, RingBufferEventHandler.png,
> RingBufferEventHandler_exception.png, SyncFuture.png,
> SyncFuture_exception.png, rs1081.jstack
>
>
> We see regionserver hang when aborting several times and cause all regions on
> this regionserver out of service and then all affected applications stop
> works.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)