[jira] [Commented] (HBASE-16960) RegionServer hang when aborting

stack (JIRA) Fri, 28 Oct 2016 22:36:56 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-16960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15617495#comment-15617495
 ]


stack commented on HBASE-16960:
-------------------------------

I looked at the patch.

I was concerned that it was cancelling already running syncs but it does not 
seem to do that. We do not want to stop currently running syncs. They were 
started before this failed append.  If they succeed, no dataloss. A few 
handlers are going to get IOExceptions but all up to the failed append will 
have been synced.  If they do not succeed, then could be data loss but 
syncrunner should be screaming to kill the RegionServer so it will replay logs.

bq. But the problem in this JIRA is some case that there's no further syncs 
after append fails, and causing an isolated sync then infinite wait. 

Yes. We seem to keep turning up corner cases that can bring about this stuck 
state. It is a weakness of the implementation that every append must be 
followed by a sync else the machinery gets stuck. [~aoxiang] suggests a 
timeout. I think a long timeout that takes a look around to see what the state 
of things is and rethrows an abort if appropriate is something that I wanted to 
avoid but it seems sensible after seeing this the second or third lockup that 
has been caught out in the wild.

Thanks lads for digging in on this tough one.





> RegionServer hang when aborting
> -------------------------------
>
>                 Key: HBASE-16960
>                 URL: https://issues.apache.org/jira/browse/HBASE-16960
>             Project: HBase
>          Issue Type: Bug
>            Reporter: binlijin
>            Assignee: binlijin
>         Attachments: HBASE-16960.patch, RingBufferEventHandler.png, 
> RingBufferEventHandler_exception.png, SyncFuture.png, 
> SyncFuture_exception.png, rs1081.jstack
>
>
> We see regionserver hang when aborting several times and cause all regions on 
> this regionserver out of service and then all affected applications stop 
> works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-16960) RegionServer hang when aborting

Reply via email to