[
https://issues.apache.org/jira/browse/HBASE-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844356#comment-16844356
]
Hudson commented on HBASE-22289:
--------------------------------
Results for branch branch-2.1
[build #1164 on
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/1164/]:
(x) *{color:red}-1 overall{color}*
----
details (if available):
(x) {color:red}-1 general checks{color}
-- Something went wrong running this stage, please [check relevant console
output|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/1164//console].
(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2)
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/1164//JDK8_Nightly_Build_Report_(Hadoop2)/]
(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3)
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/1164//JDK8_Nightly_Build_Report_(Hadoop3)/]
(/) {color:green}+1 source release artifact{color}
-- See build output for details.
(/) {color:green}+1 client integration test{color}
> WAL-based log splitting resubmit threshold may result in a task being stuck
> forever
> -----------------------------------------------------------------------------------
>
> Key: HBASE-22289
> URL: https://issues.apache.org/jira/browse/HBASE-22289
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.1.0, 1.5.0
> Reporter: Sergey Shelukhin
> Assignee: Sergey Shelukhin
> Priority: Major
> Fix For: 2.0.6, 2.1.5
>
> Attachments: HBASE-22289.01-branch-2.1.patch,
> HBASE-22289.02-branch-2.1.patch, HBASE-22289.03-branch-2.1.patch,
> HBASE-22289.branch-2.1.001.patch, HBASE-22289.branch-2.1.001.patch,
> HBASE-22289.branch-2.1.001.patch
>
>
> Not sure if this is handled better in procedure based WAL splitting; in any
> case it affects versions before that.
> The problem is not in ZK as such but in internal state tracking in master, it
> seems.
> Master:
> {noformat}
> 2019-04-21 01:49:49,584 INFO
> [master/<master>:17000.splitLogManager..Chore.1]
> coordination.SplitLogManagerCoordination: Resubmitting task
> <path>.1555831286638
> {noformat}
> worker-rs, split fails
> {noformat}
> ....
> 2019-04-21 02:05:31,774 INFO
> [RS_LOG_REPLAY_OPS-regionserver/<worker-rs>:17020-1] wal.WALSplitter:
> Processed 24 edits across 2 regions; edits skipped=457; log
> file=<path>.1555831286638, length=2156363702, corrupted=false, progress
> failed=true
> {noformat}
> Master (not sure about the delay of the acquired-message; at any rate it
> seems to detect the failure fine from this server)
> {noformat}
> 2019-04-21 02:11:14,928 INFO [main-EventThread]
> coordination.SplitLogManagerCoordination: Task <path>.1555831286638 acquired
> by <worker-rs>,17020,1555539815097
> 2019-04-21 02:19:41,264 INFO
> [master/<master>:17000.splitLogManager..Chore.1]
> coordination.SplitLogManagerCoordination: Skipping resubmissions of task
> <path>.1555831286638 because threshold 3 reached
> {noformat}
> After that this task is stuck in the limbo forever with the old worker, and
> never resubmitted.
> RS never logs anything else for this task.
> Killing the RS on the worker unblocked the task and some other server did the
> split very quickly, so seems like master doesn't clear the worker name in its
> internal state when hitting the threshold... master never restarted so
> restarting the master might have also cleared it.
> This is extracted from splitlogmanager log messages, note the times.
> {noformat}
> 2019-04-21 02:2 1555831286638=last_update = 1555837874928 last_version = 11
> cur_worker_name = <worker-rs>,17020,1555539815097 status = in_progress
> incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20,
> ....
> 2019-04-22 11:1 1555831286638=last_update = 1555837874928 last_version = 11
> cur_worker_name = <worker-rs>,17020,1555539815097 status = in_progress
> incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20}
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)