[jira] [Updated] (HBASE-22289) WAL-based log splitting resubmit threshold may result in a task being stuck forever

stack (JIRA) Sat, 18 May 2019 14:15:09 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


stack updated HBASE-22289:
--------------------------
    Attachment: HBASE-22289.branch-2.1.001.patch

> WAL-based log splitting resubmit threshold may result in a task being stuck 
> forever
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-22289
>                 URL: https://issues.apache.org/jira/browse/HBASE-22289
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.1.0, 1.5.0
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Major
>             Fix For: 2.1.5
>
>         Attachments: HBASE-22289.01-branch-2.1.patch, 
> HBASE-22289.02-branch-2.1.patch, HBASE-22289.03-branch-2.1.patch, 
> HBASE-22289.branch-2.1.001.patch, HBASE-22289.branch-2.1.001.patch, 
> HBASE-22289.branch-2.1.001.patch
>
>
> Not sure if this is handled better in procedure based WAL splitting; in any 
> case it affects versions before that.
> The problem is not in ZK as such but in internal state tracking in master, it 
> seems.
> Master:
> {noformat}
> 2019-04-21 01:49:49,584 INFO  
> [master/<master>:17000.splitLogManager..Chore.1] 
> coordination.SplitLogManagerCoordination: Resubmitting task 
> <path>.1555831286638
> {noformat}
> worker-rs, split fails 
> {noformat}
> ....
> 2019-04-21 02:05:31,774 INFO  
> [RS_LOG_REPLAY_OPS-regionserver/<worker-rs>:17020-1] wal.WALSplitter: 
> Processed 24 edits across 2 regions; edits skipped=457; log 
> file=<path>.1555831286638, length=2156363702, corrupted=false, progress 
> failed=true
> {noformat}
> Master (not sure about the delay of the acquired-message; at any rate it 
> seems to detect the failure fine from this server)
> {noformat}
> 2019-04-21 02:11:14,928 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination: Task <path>.1555831286638 acquired 
> by <worker-rs>,17020,1555539815097
> 2019-04-21 02:19:41,264 INFO  
> [master/<master>:17000.splitLogManager..Chore.1] 
> coordination.SplitLogManagerCoordination: Skipping resubmissions of task 
> <path>.1555831286638 because threshold 3 reached
> {noformat}
> After that this task is stuck in the limbo forever with the old worker, and 
> never resubmitted. 
> RS never logs anything else for this task.
> Killing the RS on the worker unblocked the task and some other server did the 
> split very quickly, so seems like master doesn't clear the worker name in its 
> internal state when hitting the threshold... master never restarted so 
> restarting the master might have also cleared it.
> This is extracted from splitlogmanager log messages, note the times.
> {noformat}
> 2019-04-21 02:2   1555831286638=last_update = 1555837874928 last_version = 11 
> cur_worker_name = <worker-rs>,17020,1555539815097 status = in_progress 
> incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20, 
> ....
> 2019-04-22 11:1   1555831286638=last_update = 1555837874928 last_version = 11 
> cur_worker_name = <worker-rs>,17020,1555539815097 status = in_progress 
> incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-22289) WAL-based log splitting resubmit threshold may result in a task being stuck forever

Reply via email to