Sergey Shelukhin created HBASE-22289:
----------------------------------------

             Summary: WAL-based log splitting resubmit threshold results in a 
task being stuck forever
                 Key: HBASE-22289
                 URL: https://issues.apache.org/jira/browse/HBASE-22289
             Project: HBase
          Issue Type: Bug
            Reporter: Sergey Shelukhin


Not sure if this is handled better in procedure based WAL splitting; in any 
case it affects versions before that.
The problem is not in ZK as such but in internal state tracking in master, it 
seems.

Master:
{noformat}
2019-04-21 01:49:49,584 INFO  [master/<master>:17000.splitLogManager..Chore.1] 
coordination.SplitLogManagerCoordination: Resubmitting task <path>.1555831286638
{noformat}

worker-rs, split fails 
{noformat}
....
2019-04-21 02:05:31,774 INFO  
[RS_LOG_REPLAY_OPS-regionserver/<worker-rs>:17020-1] wal.WALSplitter: Processed 
24 edits across 2 regions; edits skipped=457; log file=<path>.1555831286638, 
length=2156363702, corrupted=false, progress failed=true
{noformat}


Master (not sure about the delay of the acquired-message; at any rate it seems 
to detect the failure fine from this server)
{noformat}
2019-04-21 02:11:14,928 INFO  [main-EventThread] 
coordination.SplitLogManagerCoordination: Task <path>.1555831286638 acquired by 
<worker-rs>,17020,1555539815097
2019-04-21 02:19:41,264 INFO  [master/<master>:17000.splitLogManager..Chore.1] 
coordination.SplitLogManagerCoordination: Skipping resubmissions of task 
<path>.1555831286638 because threshold 3 reached
{noformat}

After that this task is stuck in the limbo forever with the old worker, and 
never resubmitted. 
RS never logs anything else for this task.
Killing the RS on the worker unblocked the task and some other server did the 
split very quickly, so seems like master doesn't clear the worker name in its 
internal state when hitting the threshold... master never restarted so 
restarting the master might have also cleared it.
This is extracted from splitlogmanager log messages, note the times.
{noformat}
2019-04-21 02:2   1555831286638=last_update = 1555837874928 last_version = 11 
cur_worker_name = <worker-rs>,17020,1555539815097 status = in_progress 
incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20, 
....
2019-04-22 11:1   1555831286638=last_update = 1555837874928 last_version = 11 
cur_worker_name = <worker-rs>,17020,1555539815097 status = in_progress 
incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20}
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to