[jira] [Commented] (KUDU-3017) master crashes on attempt to replay orphaned ops in WAL, not reporting the root cause of the problem

ASF subversion and git services (Jira) Wed, 11 Dec 2019 11:56:47 -0800


    [ 
https://issues.apache.org/jira/browse/KUDU-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993862#comment-16993862
 ]


ASF subversion and git services commented on KUDU-3017:
-------------------------------------------------------

Commit 46b4d6f24fdacb79aa574a016bf5a6cb51d5e3b8 in kudu's branch 
refs/heads/master from Alexey Serbin
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=46b4d6f ]

[master] KUDU-3017 clean-up SysCatalogTable::SetupTablet()

If master fails to replay orphaned operations from the WAL during
the bootstrap, it crashes at the system tablet's state check
(the 'orphaned operations' are REPLICATE messages which are in the WAL
with no accompanying COMMIT):

    F1206 01:32:15.488359 1324967 tablet_replica.cc:138] Check failed: state_ 
== SHUTDOWN || state_ == FAILED TabletReplica not fully shut down. State: 
BOOTSTRAPPING

This patch addresses the issue, so master would not crash at the tablet
state consistency CHECK() under such conditions.  Instead, now it
reports corresponding error and crashes at the higher level
in master_main.cc.  With this patch, it's easier to attribute
a failure to the root cause by looking into the master's log.

I didn't add any tests: the replay of the orphaned WAL transactions
during bootstrap has pretty good coverage already.

Change-Id: I6adfd7f74fdd2e05e04f6418cbf9bb86cad6465a
Reviewed-on: http://gerrit.cloudera.org:8080/14881
Tested-by: Kudu Jenkins
Reviewed-by: Andrew Wong <[email protected]>


> master crashes on attempt to replay orphaned ops in WAL, not reporting the 
> root cause of the problem
> ----------------------------------------------------------------------------------------------------
>
>                 Key: KUDU-3017
>                 URL: https://issues.apache.org/jira/browse/KUDU-3017
>             Project: Kudu
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.7.0, 1.8.0, 1.7.1, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 
> 1.11.1
>            Reporter: Alexey Serbin
>            Priority: Minor
>         Attachments: core.stack.xz
>
>
> This bug is about misreporting the root cause of the problem, so it's not 
> easy to correlate the error message with the actual problem and at the phase 
> of the process lifecycle. After analysis, it turned to be just another 
> manifestation/consequence of 
> [KUDU-3016|https://issues.apache.org/jira/browse/KUDU-3016].
> I saw master crashing with the following error reported in the log:
> {noformat}
> F1206 01:32:15.488359 1324967 tablet_replica.cc:138] Check failed: state_ == 
> SHUTDOWN || state_ == FAILED TabletReplica not fully shut down. State: 
> BOOTSTRAPPING
> {noformat}
> It's not easy to tell at what point of master lifecycle it happened, but 
> after looking around in the log and into the generated core file it became 
> clear the problem was just a consequence of the conditions that triggered 
> KUDU-3016 at first place:
> Extra info from the log:
> {noformat}
> I1206 01:32:15.419330 1324967 tablet_bootstrap.cc:439] T 
> 00000000000000000000000000000000 P 77360e3dee9f4a748e75f830554326a8: 
> Bootstrap complete.
> I1206 01:32:15.471163 1324967 raft_consensus.cc:340] T 
> 00000000000000000000000000000000 P 77360e3dee9f4a748e75f830554326a8 [term 164 
> FOLLOWER]: Replica starting. Triggering 11 pending transactions. Active 
> config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: 
> "77360e3dee9f4a748e75f830554326a8" member_type: VOTER last_known_addr { host: 
> "master0" port: 7051 } } peers { permanent_uuid: 
> "2a23cf2aee7549fbb63e6f8bcfb08cc3" member_type: VOTER last_known_addr { host: 
> "master1" port: 7051 } } peers { permanent_uuid: 
> "97326d428af84cf88d95eefe32eca0bd" member_type: VOTER last_known_addr { host: 
> "master2" port: 7051 } }
> W1206 01:32:15.488217 1324967 transaction_tracker.cc:122] transaction on 
> tablet 00000000000000000000000000000000 rejected due to memory pressure: the 
> memory usage of this transaction (91215642) plus the current consumption (0) 
> exceeds the transaction memory limit (67108864) or the limit of an ancestral 
> memory tracker.
> {noformat}
> See the attached file for the stack trace captured in the core file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KUDU-3017) master crashes on attempt to replay orphaned ops in WAL, not reporting the root cause of the problem

Reply via email to