Adar Dembo has posted comments on this change. (
http://gerrit.cloudera.org:8080/12075 )
Change subject: KUDU-2636: LBM supports deleting dead and full containers
......................................................................
Patch Set 6:
(11 comments)
The dense_node-itest failure is interesting. Here's some useful bits from the
log:
I1219 04:37:17.664212 817 log_block_manager.cc:2002] Deleted 1 dead
containers (1256 metadata bytes)
...
I1219 04:37:19.328910 817 log_block_manager.cc:2002] Deleted 2 dead
containers (2513 metadata bytes)
F1219 04:37:19.336031 817 log_block_manager.cc:1969] Check failed: to_delete
*** Check failure stack trace: ***
@ 0x7fc8c8bd162d google::LogMessage::Fail() at ??:0
@ 0x7fc8c8bd364c google::LogMessage::SendToLog() at ??:0
@ 0x7fc8c8bd1189 google::LogMessage::Flush() at ??:0
@ 0x7fc8c8bd3fdf google::LogMessageFatal::~LogMessageFatal() at ??:0
@ 0x7fc8c68bf7eb
kudu::fs::LogBlockManager::RemoveFullContainersUnlocked() at ??:0
@ 0x7fc8c68bf8d3 kudu::fs::LogBlockManager::RemoveDeadContainers() at
??:0
@ 0x7fc8c938065e kudu::ThreadPool::DispatchThread() at ??:0
@ 0x7fc8c937626e kudu::Thread::SuperviseThread() at ??:0
@ 0x7fc8c6d95184 start_thread at ??:0
@ 0x7fc8c81dbffd clone at ??:0
FYI, the test runs with only one data directory. But it seems like it's trying
to delete the same set of containers maybe? Some more logging in
log_block_manager.cc may help. You may also be able to reproduce the failure
locally.
http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.h
File src/kudu/fs/log_block_manager.h:
http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.h@276
PS6, Line 276: // Removes dead containers from disk.
: //
: // Must call RemoveFullContainersUnlocked() first.
This is a good place to expand about the interaction with FileCache. You could
say:
// Removes dead containers from disk.
//
// Must first call RemoveFullContainersUnlocked() to remove the containers
from
// memory. This guarantees that the containers' FileCache descriptors are
expired.
// Thus, the FileCache deletions done here will delete the files straight
away,
// which allows the syncing of the data dir to actually have some effect.
http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:
http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc@1237
PS6, Line 1237: // Due to the shared ownership of
LogBlockDeletionTransaction, the destructor is
: // responsible for destroying all registered blocks. This
includes:
: // 1. Punching holes in deleted blocks, and
: // 2. Deleting dead containers outright.
The formatting issue persists. Take a look at the diff in gerrit; you'll see
that each line is offset from the next with an extra blank line.
http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc@1256
PS6, Line 1256: std::unordered_map<internal::LogBlockContainer*,
: std::vector<BlockInterval>> deleted_interval_map_;
Nit: while you're here, could you drop the unnecessary std:: prefixes here and
on L1264?
http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc@1260
PS6, Line 1260: unordered_map<DataDir*, set<string>> dead_containers_by_dir_;
The value would probably be more efficient as an unordered_set rather than a
set.
http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc@1631
PS6, Line 1631: // Access to container_ must be before calling
log_block_.reset().
: //
: // Because log_block_.reset() may cause the LogBlock() to
lose its last reference
: // and be destroyed. This may cause a registered
LogBlockDeletionTransaction to lose
: // its last reference and be destroyed, which will trigger
the asynchronous code that
: // will eventually destroy the container which is dead.
Thus, the ordering of operations
: // here is very important, otherwise it is unsafe.
Nit: reword a bit:
// We may not safely access container_ after calling log_block_.reset().
//
// log_block_.reset() may cause the LogBlock() to lose its last reference and
be destroyed.
// This may cause a registered LogBlockDeletionTransaction to lose its last
reference and
// be destroyed, which will trigger the asynchronous code in its destructor
that will
// eventually destroy the (now dead) container.
http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc@1965
PS6, Line 1965: // The Descriptor() in the FileCache() will be destroyed
: // along with LogBlockContainer's destructor.
Nit: I would drop this comment. It's not really necessary to improve the
understanding of RemoveFullContainersUnlocked; it's more useful for
RemoveDeadContainersFromDisk.
http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc@1980
PS6, Line 1980: DataDir* dir, const std::vector<std::string>& names) {
Don't need std:: prefixes.
http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc@1982
PS6, Line 1982: for (const auto& name : names) {
There are some extra line formatting issues here.
http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc@2008
PS6, Line 2008: // Remove from the memory.
Nit: "Remove the containers from memory."
http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc@2014
PS6, Line 2014: // Remove from the disk.
Nit: "Remove the containers from disk."
(the problem with 'from the disk' is that it suggests that there's just one
disk).
http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc@2580
PS6, Line 2580: // After the deletions, the data directory is sync'ed to
reduce the chance
: // of a data file existing without its corresponding metadata
file (or vice
: // versa) in the event of a crash. The block manager would
treat such a case
: // as corruption and require manual intervention.
: //
: // TODO(adar) the above is not fool-proof; a crash could
manifest in between
: // any pair of deletions. That said, the odds of it happening
are incredibly
: // rare, and manual resolution isn't hard (just delete the
existing file).
Move this into the body of RemoveDeadContainersFromDisk, just above the call to
SyncDir, and remove the "After the deletions, ..." part of the comment.
--
To view, visit http://gerrit.cloudera.org:8080/12075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7d3672ae3c3dd9acd489120d653c44a641537f10
Gerrit-Change-Number: 12075
Gerrit-PatchSet: 6
Gerrit-Owner: helifu <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Hao Hao <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: helifu <[email protected]>
Gerrit-Comment-Date: Wed, 19 Dec 2018 05:10:28 +0000
Gerrit-HasComments: Yes