[
https://issues.apache.org/jira/browse/KUDU-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17392473#comment-17392473
]
Alexey Serbin edited comment on KUDU-2635 at 8/3/21, 7:15 PM:
--------------------------------------------------------------
[~kongfuboy], if the concern is about possible issues introduced in newer Kudu
versions, a path forward might be picking up the
[patch|https://github.com/apache/kudu/commit/bec75e5ac03eec74da8c091a99a4b9f9a27e2b2b]
and back-porting it to the source of the branch used to build the binaries
that your cluster is running (I guess it's 1.10?). Then rebuild the
{{kudu-tserver}} binary and replace it at Kudu tablet server nodes, making sure
to restart {{kudu-tserver}} processes after the binary is replaced.
As you can see, the essence of the
[fix|https://github.com/apache/kudu/commit/bec75e5ac03eec74da8c091a99a4b9f9a27e2b2b]
(modulo an extra test added) is a just single line in
{{src/kudu/tablet/tablet_metadata.cc}}.
was (Author: aserbin):
[~kongfuboy], if the concern is about possible issues introduced in newer Kudu
versions, a path forward might be picking up the
[change|https://github.com/apache/kudu/commit/bec75e5ac03eec74da8c091a99a4b9f9a27e2b2b]
and back-porting it to the source of the branch used to build the binaries
that your cluster is running (I guess it's 1.10?). Then rebuild the
{{kudu-tserver}} binary and replace it at Kudu tablet server nodes, making sure
to restart {{kudu-tserver}} processes after the binary is replaces.
As you can see, the essence of the change (modulo an extra test added) is a
just single line in {{src/kudu/tablet/tablet_metadata.cc}}.
> Tserver crash because some orphaned blocks are still listed when deleting
> metadata
> ----------------------------------------------------------------------------------
>
> Key: KUDU-2635
> URL: https://issues.apache.org/jira/browse/KUDU-2635
> Project: Kudu
> Issue Type: Bug
> Components: fs, tablet, tserver
> Affects Versions: 1.7.0
> Reporter: Andrew Wong
> Assignee: Andrew Wong
> Priority: Major
> Fix For: 1.11.0
>
>
> In some cases, upon deleting a tablet, a tablet server may fail to delete
> some blocks, and then fail to delete the tablet metadata, leading to a crash
> since failure to delete metadata is a fatal error. That's what happened in
> the below logs, but it's unclear why the blocks failed to be deleted, and why
> the server stayed up for a couple minutes after before receiving a delete
> tablet request, and ultimately crashing. Following the crash, the server was
> able to start up successfully.
>
> {{I1130 00:00:07.565915 29721 tablet_service.cc:795] Processing DeleteTablet
> for tablet 1db7aa7e81474907ace3d493c24cdc94 with delete_type
> TABLET_DATA_DELETED (Partition dropped at 2018-11-30 00:00:07 PST) from
> \{username='kudu'} at 10.93.87.15:47194}}
> {{I1130 00:00:07.565929 29721 tablet_replica.cc:262] T
> 1db7aa7e81474907ace3d493c24cdc94 P 97235196a93b41c29954ed8534aa2ddc: stopping
> tablet replica}}
> {{I1130 00:00:07.565954 29721 maintenance_manager.cc:235] P
> 97235196a93b41c29954ed8534aa2ddc: Unregistered op
> CompactRowSetsOp(1db7aa7e81474907ace3d493c24cdc94)}}
> {{I1130 00:00:07.565997 29721 maintenance_manager.cc:235] P
> 97235196a93b41c29954ed8534aa2ddc: Unregistered op
> MinorDeltaCompactionOp(1db7aa7e81474907ace3d493c24cdc94)}}
> {{I1130 00:00:07.566010 29721 maintenance_manager.cc:235] P
> 97235196a93b41c29954ed8534aa2ddc: Unregistered op
> MajorDeltaCompactionOp(1db7aa7e81474907ace3d493c24cdc94)}}
> {{I1130 00:00:07.566020 29721 maintenance_manager.cc:235] P
> 97235196a93b41c29954ed8534aa2ddc: Unregistered op
> UndoDeltaBlockGCOp(1db7aa7e81474907ace3d493c24cdc94)}}
> {{I1130 00:00:07.566032 29721 maintenance_manager.cc:235] P
> 97235196a93b41c29954ed8534aa2ddc: Unregistered op
> FlushMRSOp(1db7aa7e81474907ace3d493c24cdc94)}}
> {{I1130 00:00:07.566040 29721 maintenance_manager.cc:235] P
> 97235196a93b41c29954ed8534aa2ddc: Unregistered op
> FlushDeltaMemStoresOp(1db7aa7e81474907ace3d493c24cdc94)}}
> {{I1130 00:00:07.566048 29721 maintenance_manager.cc:235] P
> 97235196a93b41c29954ed8534aa2ddc: Unregistered op
> LogGCOp(1db7aa7e81474907ace3d493c24cdc94)}}
> {{I1130 00:00:07.566056 29721 raft_consensus.cc:2012] T
> 1db7aa7e81474907ace3d493c24cdc94 P 97235196a93b41c29954ed8534aa2ddc [term 3
> FOLLOWER]: Raft consensus shutting down.}}
> {{I1130 00:00:07.566074 29721 raft_consensus.cc:2039] T
> 1db7aa7e81474907ace3d493c24cdc94 P 97235196a93b41c29954ed8534aa2ddc [term 3
> FOLLOWER]: Raft consensus is shut down!}}
> {{I1130 00:00:07.666061 29721 ts_tablet_manager.cc:1277] T
> 1db7aa7e81474907ace3d493c24cdc94 P 97235196a93b41c29954ed8534aa2ddc: Deleting
> tablet data with delete state TABLET_DATA_DELETED}}
> {{I1130 00:00:08.102607 29721 ts_tablet_manager.cc:1290] T
> 1db7aa7e81474907ace3d493c24cdc94 P 97235196a93b41c29954ed8534aa2ddc: tablet
> deleted with delete type TABLET_DATA_DELETED: last-logged OpId 3.1166195}}
> {{I1130 00:00:08.102629 29721 log.cc:981] T 1db7aa7e81474907ace3d493c24cdc94
> P 97235196a93b41c29954ed8534aa2ddc: Deleting WAL directory at
> /home/kudu/tablet/wal/wals/1db7aa7e81474907ace3d493c24cdc94}}
> {{I1130 00:00:08.103217 29721 ts_tablet_manager.cc:1310] T
> 1db7aa7e81474907ace3d493c24cdc94 P 97235196a93b41c29954ed8534aa2ddc: Deleting
> consensus metadata}}
> {{F1130 00:00:08.155643 29721 ts_tablet_manager.cc:848] Failed to delete
> tablet data for 1db7aa7e81474907ace3d493c24cdc94: Invalid argument: Unable to
> delete on-disk data from tablet 1db7aa7e81474907ace3d493c24cdc94: The
> metadata for tablet 1db7aa7e81474907ace3d493c24cdc94 still references
> orphaned blocks. Call DeleteTabletData() first}}
> {{I1130 00:02:09.460352 29725 tablet_service.cc:795] Processing DeleteTablet
> for tablet 1db7aa7e81474907ace3d493c24cdc94 with delete_type
> TABLET_DATA_DELETED (Partition dropped at 2018-11-30 00:00:07 PST) from
> \{username='kudu'} at 10.93.87.15:47194}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)