Alexey Serbin created KUDU-3017:
-----------------------------------

             Summary: master crashes on attemp to replay orphaned ops in WAL, 
not reporting the root cause of the problem
                 Key: KUDU-3017
                 URL: https://issues.apache.org/jira/browse/KUDU-3017
             Project: Kudu
          Issue Type: Bug
          Components: master
    Affects Versions: 1.11.1, 1.11.0, 1.10.1, 1.10.0
            Reporter: Alexey Serbin
         Attachments: core.stack.xz

This bug is about misreporting the root cause of the problem, so it's not easy 
to correlate the error message with the actual problem and at the phase of the 
process lifecycle. After analysis, it turned to be just another 
manifestation/consequence of 
[KUDU-3016|https://issues.apache.org/jira/browse/KUDU-3016].

I saw master crashing with the following error reported in the log:

{noformat}
F1206 01:32:15.488359 1324967 tablet_replica.cc:138] Check failed: state_ == 
SHUTDOWN || state_ == FAILED TabletReplica not fully shut down. State: 
BOOTSTRAPPING
{noformat}

It's not easy to tell at what point of master lifecycle it happened, but after 
looking around in the log and into the generated core file it became clear the 
problem was just a consequence of the conditions that triggered KUDU-3016 at 
first place:

Extra info from the log:
{noformat}
I1206 01:32:15.419330 1324967 tablet_bootstrap.cc:439] T 
00000000000000000000000000000000 P 77360e3dee9f4a748e75f830554326a8: Bootstrap 
complete.
I1206 01:32:15.471163 1324967 raft_consensus.cc:340] T 
00000000000000000000000000000000 P 77360e3dee9f4a748e75f830554326a8 [term 164 
FOLLOWER]: Replica starting. Triggering 11 pending transactions. Active config: 
opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: 
"77360e3dee9f4a748e75f830554326a8" member_type: VOTER last_known_addr { host: 
"nrappmst3.nrap.lguplus.co.kr" port: 7051 } } peers { permanent_uuid: 
"2a23cf2aee7549fbb63e6f8bcfb08cc3" member_type: VOTER last_known_addr { host: 
"nrappmst4.nrap.lguplus.co.kr" port: 7051 } } peers { permanent_uuid: 
"97326d428af84cf88d95eefe32eca0bd" member_type: VOTER last_known_addr { host: 
"nrappmst5.nrap.lguplus.co.kr" port: 7051 } }
W1206 01:32:15.488217 1324967 transaction_tracker.cc:122] transaction on tablet 
00000000000000000000000000000000 rejected due to memory pressure: the memory 
usage of this transaction (91215642) plus the current consumption (0) exceeds 
the transaction memory limit (67108864) or the limit of an ancestral memory 
tracker.
{noformat}

See the attached file for the stack trace captured in the core file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to