[ 
https://issues.apache.org/jira/browse/HBASE-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844493#comment-16844493
 ] 

Hudson commented on HBASE-22289:
--------------------------------

Results for branch branch-2.0
        [build #1605 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1605/]: 
(x) *{color:red}-1 overall{color}*
----
details (if available):

(x) {color:red}-1 general checks{color}
-- Something went wrong running this stage, please [check relevant console 
output|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1605//console].




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- Something went wrong running this stage, please [check relevant console 
output|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1605//console].


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1605//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


> WAL-based log splitting resubmit threshold may result in a task being stuck 
> forever
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-22289
>                 URL: https://issues.apache.org/jira/browse/HBASE-22289
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.1.0, 1.5.0
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Major
>             Fix For: 2.0.6, 2.1.5
>
>         Attachments: HBASE-22289.01-branch-2.1.patch, 
> HBASE-22289.02-branch-2.1.patch, HBASE-22289.03-branch-2.1.patch, 
> HBASE-22289.branch-2.1.001.patch, HBASE-22289.branch-2.1.001.patch, 
> HBASE-22289.branch-2.1.001.patch
>
>
> Not sure if this is handled better in procedure based WAL splitting; in any 
> case it affects versions before that.
> The problem is not in ZK as such but in internal state tracking in master, it 
> seems.
> Master:
> {noformat}
> 2019-04-21 01:49:49,584 INFO  
> [master/<master>:17000.splitLogManager..Chore.1] 
> coordination.SplitLogManagerCoordination: Resubmitting task 
> <path>.1555831286638
> {noformat}
> worker-rs, split fails 
> {noformat}
> ....
> 2019-04-21 02:05:31,774 INFO  
> [RS_LOG_REPLAY_OPS-regionserver/<worker-rs>:17020-1] wal.WALSplitter: 
> Processed 24 edits across 2 regions; edits skipped=457; log 
> file=<path>.1555831286638, length=2156363702, corrupted=false, progress 
> failed=true
> {noformat}
> Master (not sure about the delay of the acquired-message; at any rate it 
> seems to detect the failure fine from this server)
> {noformat}
> 2019-04-21 02:11:14,928 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination: Task <path>.1555831286638 acquired 
> by <worker-rs>,17020,1555539815097
> 2019-04-21 02:19:41,264 INFO  
> [master/<master>:17000.splitLogManager..Chore.1] 
> coordination.SplitLogManagerCoordination: Skipping resubmissions of task 
> <path>.1555831286638 because threshold 3 reached
> {noformat}
> After that this task is stuck in the limbo forever with the old worker, and 
> never resubmitted. 
> RS never logs anything else for this task.
> Killing the RS on the worker unblocked the task and some other server did the 
> split very quickly, so seems like master doesn't clear the worker name in its 
> internal state when hitting the threshold... master never restarted so 
> restarting the master might have also cleared it.
> This is extracted from splitlogmanager log messages, note the times.
> {noformat}
> 2019-04-21 02:2   1555831286638=last_update = 1555837874928 last_version = 11 
> cur_worker_name = <worker-rs>,17020,1555539815097 status = in_progress 
> incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20, 
> ....
> 2019-04-22 11:1   1555831286638=last_update = 1555837874928 last_version = 11 
> cur_worker_name = <worker-rs>,17020,1555539815097 status = in_progress 
> incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to