[
https://issues.apache.org/jira/browse/KUDU-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexey Serbin updated KUDU-3017:
--------------------------------
Status: In Review (was: Open)
> master crashes on attemp to replay orphaned ops in WAL, not reporting the
> root cause of the problem
> ---------------------------------------------------------------------------------------------------
>
> Key: KUDU-3017
> URL: https://issues.apache.org/jira/browse/KUDU-3017
> Project: Kudu
> Issue Type: Bug
> Components: master
> Affects Versions: 1.10.0, 1.10.1, 1.11.0, 1.11.1
> Reporter: Alexey Serbin
> Priority: Minor
> Attachments: core.stack.xz
>
>
> This bug is about misreporting the root cause of the problem, so it's not
> easy to correlate the error message with the actual problem and at the phase
> of the process lifecycle. After analysis, it turned to be just another
> manifestation/consequence of
> [KUDU-3016|https://issues.apache.org/jira/browse/KUDU-3016].
> I saw master crashing with the following error reported in the log:
> {noformat}
> F1206 01:32:15.488359 1324967 tablet_replica.cc:138] Check failed: state_ ==
> SHUTDOWN || state_ == FAILED TabletReplica not fully shut down. State:
> BOOTSTRAPPING
> {noformat}
> It's not easy to tell at what point of master lifecycle it happened, but
> after looking around in the log and into the generated core file it became
> clear the problem was just a consequence of the conditions that triggered
> KUDU-3016 at first place:
> Extra info from the log:
> {noformat}
> I1206 01:32:15.419330 1324967 tablet_bootstrap.cc:439] T
> 00000000000000000000000000000000 P 77360e3dee9f4a748e75f830554326a8:
> Bootstrap complete.
> I1206 01:32:15.471163 1324967 raft_consensus.cc:340] T
> 00000000000000000000000000000000 P 77360e3dee9f4a748e75f830554326a8 [term 164
> FOLLOWER]: Replica starting. Triggering 11 pending transactions. Active
> config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid:
> "77360e3dee9f4a748e75f830554326a8" member_type: VOTER last_known_addr { host:
> "nrappmst3.nrap.lguplus.co.kr" port: 7051 } } peers { permanent_uuid:
> "2a23cf2aee7549fbb63e6f8bcfb08cc3" member_type: VOTER last_known_addr { host:
> "nrappmst4.nrap.lguplus.co.kr" port: 7051 } } peers { permanent_uuid:
> "97326d428af84cf88d95eefe32eca0bd" member_type: VOTER last_known_addr { host:
> "nrappmst5.nrap.lguplus.co.kr" port: 7051 } }
> W1206 01:32:15.488217 1324967 transaction_tracker.cc:122] transaction on
> tablet 00000000000000000000000000000000 rejected due to memory pressure: the
> memory usage of this transaction (91215642) plus the current consumption (0)
> exceeds the transaction memory limit (67108864) or the limit of an ancestral
> memory tracker.
> {noformat}
> See the attached file for the stack trace captured in the core file.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)