[
https://issues.apache.org/jira/browse/HBASE-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack updated HBASE-22289:
--------------------------
Attachment: HBASE-22289.branch-2.1.001.patch
> WAL-based log splitting resubmit threshold may result in a task being stuck
> forever
> -----------------------------------------------------------------------------------
>
> Key: HBASE-22289
> URL: https://issues.apache.org/jira/browse/HBASE-22289
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.1.0, 1.5.0
> Reporter: Sergey Shelukhin
> Assignee: Sergey Shelukhin
> Priority: Major
> Fix For: 2.1.5
>
> Attachments: HBASE-22289.01-branch-2.1.patch,
> HBASE-22289.02-branch-2.1.patch, HBASE-22289.03-branch-2.1.patch,
> HBASE-22289.branch-2.1.001.patch, HBASE-22289.branch-2.1.001.patch,
> HBASE-22289.branch-2.1.001.patch
>
>
> Not sure if this is handled better in procedure based WAL splitting; in any
> case it affects versions before that.
> The problem is not in ZK as such but in internal state tracking in master, it
> seems.
> Master:
> {noformat}
> 2019-04-21 01:49:49,584 INFO
> [master/<master>:17000.splitLogManager..Chore.1]
> coordination.SplitLogManagerCoordination: Resubmitting task
> <path>.1555831286638
> {noformat}
> worker-rs, split fails
> {noformat}
> ....
> 2019-04-21 02:05:31,774 INFO
> [RS_LOG_REPLAY_OPS-regionserver/<worker-rs>:17020-1] wal.WALSplitter:
> Processed 24 edits across 2 regions; edits skipped=457; log
> file=<path>.1555831286638, length=2156363702, corrupted=false, progress
> failed=true
> {noformat}
> Master (not sure about the delay of the acquired-message; at any rate it
> seems to detect the failure fine from this server)
> {noformat}
> 2019-04-21 02:11:14,928 INFO [main-EventThread]
> coordination.SplitLogManagerCoordination: Task <path>.1555831286638 acquired
> by <worker-rs>,17020,1555539815097
> 2019-04-21 02:19:41,264 INFO
> [master/<master>:17000.splitLogManager..Chore.1]
> coordination.SplitLogManagerCoordination: Skipping resubmissions of task
> <path>.1555831286638 because threshold 3 reached
> {noformat}
> After that this task is stuck in the limbo forever with the old worker, and
> never resubmitted.
> RS never logs anything else for this task.
> Killing the RS on the worker unblocked the task and some other server did the
> split very quickly, so seems like master doesn't clear the worker name in its
> internal state when hitting the threshold... master never restarted so
> restarting the master might have also cleared it.
> This is extracted from splitlogmanager log messages, note the times.
> {noformat}
> 2019-04-21 02:2 1555831286638=last_update = 1555837874928 last_version = 11
> cur_worker_name = <worker-rs>,17020,1555539815097 status = in_progress
> incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20,
> ....
> 2019-04-22 11:1 1555831286638=last_update = 1555837874928 last_version = 11
> cur_worker_name = <worker-rs>,17020,1555539815097 status = in_progress
> incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20}
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)