[ 
https://issues.apache.org/jira/browse/KUDU-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3017:
--------------------------------
    Affects Version/s: 1.7.0
                       1.8.0
                       1.7.1
                       1.9.0

> master crashes on attemp to replay orphaned ops in WAL, not reporting the 
> root cause of the problem
> ---------------------------------------------------------------------------------------------------
>
>                 Key: KUDU-3017
>                 URL: https://issues.apache.org/jira/browse/KUDU-3017
>             Project: Kudu
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.7.0, 1.8.0, 1.7.1, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 
> 1.11.1
>            Reporter: Alexey Serbin
>            Priority: Minor
>         Attachments: core.stack.xz
>
>
> This bug is about misreporting the root cause of the problem, so it's not 
> easy to correlate the error message with the actual problem and at the phase 
> of the process lifecycle. After analysis, it turned to be just another 
> manifestation/consequence of 
> [KUDU-3016|https://issues.apache.org/jira/browse/KUDU-3016].
> I saw master crashing with the following error reported in the log:
> {noformat}
> F1206 01:32:15.488359 1324967 tablet_replica.cc:138] Check failed: state_ == 
> SHUTDOWN || state_ == FAILED TabletReplica not fully shut down. State: 
> BOOTSTRAPPING
> {noformat}
> It's not easy to tell at what point of master lifecycle it happened, but 
> after looking around in the log and into the generated core file it became 
> clear the problem was just a consequence of the conditions that triggered 
> KUDU-3016 at first place:
> Extra info from the log:
> {noformat}
> I1206 01:32:15.419330 1324967 tablet_bootstrap.cc:439] T 
> 00000000000000000000000000000000 P 77360e3dee9f4a748e75f830554326a8: 
> Bootstrap complete.
> I1206 01:32:15.471163 1324967 raft_consensus.cc:340] T 
> 00000000000000000000000000000000 P 77360e3dee9f4a748e75f830554326a8 [term 164 
> FOLLOWER]: Replica starting. Triggering 11 pending transactions. Active 
> config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: 
> "77360e3dee9f4a748e75f830554326a8" member_type: VOTER last_known_addr { host: 
> "master0" port: 7051 } } peers { permanent_uuid: 
> "2a23cf2aee7549fbb63e6f8bcfb08cc3" member_type: VOTER last_known_addr { host: 
> "master1" port: 7051 } } peers { permanent_uuid: 
> "97326d428af84cf88d95eefe32eca0bd" member_type: VOTER last_known_addr { host: 
> "master2" port: 7051 } }
> W1206 01:32:15.488217 1324967 transaction_tracker.cc:122] transaction on 
> tablet 00000000000000000000000000000000 rejected due to memory pressure: the 
> memory usage of this transaction (91215642) plus the current consumption (0) 
> exceeds the transaction memory limit (67108864) or the limit of an ancestral 
> memory tracker.
> {noformat}
> See the attached file for the stack trace captured in the core file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to