Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12075 )

Change subject: KUDU-2636: LBM supports deleting dead and full containers
......................................................................


Patch Set 6:

(11 comments)

The dense_node-itest failure is interesting. Here's some useful bits from the 
log:

  I1219 04:37:17.664212   817 log_block_manager.cc:2002] Deleted 1 dead 
containers (1256 metadata bytes)
  ...
  I1219 04:37:19.328910   817 log_block_manager.cc:2002] Deleted 2 dead 
containers (2513 metadata bytes)
  F1219 04:37:19.336031   817 log_block_manager.cc:1969] Check failed: to_delete
  *** Check failure stack trace: ***
    @     0x7fc8c8bd162d  google::LogMessage::Fail() at ??:0
    @     0x7fc8c8bd364c  google::LogMessage::SendToLog() at ??:0
    @     0x7fc8c8bd1189  google::LogMessage::Flush() at ??:0
    @     0x7fc8c8bd3fdf  google::LogMessageFatal::~LogMessageFatal() at ??:0
    @     0x7fc8c68bf7eb  
kudu::fs::LogBlockManager::RemoveFullContainersUnlocked() at ??:0
    @     0x7fc8c68bf8d3  kudu::fs::LogBlockManager::RemoveDeadContainers() at 
??:0
    @     0x7fc8c938065e  kudu::ThreadPool::DispatchThread() at ??:0
    @     0x7fc8c937626e  kudu::Thread::SuperviseThread() at ??:0
    @     0x7fc8c6d95184  start_thread at ??:0
    @     0x7fc8c81dbffd  clone at ??:0

FYI, the test runs with only one data directory. But it seems like it's trying 
to delete the same set of containers maybe? Some more logging in 
log_block_manager.cc may help. You may also be able to reproduce the failure 
locally.

http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.h
File src/kudu/fs/log_block_manager.h:

http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.h@276
PS6, Line 276:   // Removes dead containers from disk.
             :   //
             :   // Must call RemoveFullContainersUnlocked() first.
This is a good place to expand about the interaction with FileCache. You could 
say:

  // Removes dead containers from disk.
  //
  // Must first call RemoveFullContainersUnlocked() to remove the containers 
from
  // memory. This guarantees that the containers' FileCache descriptors are 
expired.
  // Thus, the FileCache deletions done here will delete the files straight 
away,
  // which allows the syncing of the data dir to actually have some effect.


http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc@1237
PS6, Line 1237:   // Due to the shared ownership of 
LogBlockDeletionTransaction, the destructor is
              :   // responsible for destroying all registered blocks. This 
includes:
              :   // 1. Punching holes in deleted blocks, and
              :   // 2. Deleting dead containers outright.
The formatting issue persists. Take a look at the diff in gerrit; you'll see 
that each line is offset from the next with an extra blank line.


http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc@1256
PS6, Line 1256:   std::unordered_map<internal::LogBlockContainer*,
              :       std::vector<BlockInterval>> deleted_interval_map_;
Nit: while you're here, could you drop the unnecessary std:: prefixes here and 
on L1264?


http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc@1260
PS6, Line 1260:   unordered_map<DataDir*, set<string>> dead_containers_by_dir_;
The value would probably be more efficient as an unordered_set rather than a 
set.


http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc@1631
PS6, Line 1631:     // Access to container_ must be before calling 
log_block_.reset().
              :     //
              :     // Because log_block_.reset() may cause the LogBlock() to 
lose its last reference
              :     // and be destroyed. This may cause a registered 
LogBlockDeletionTransaction to lose
              :     // its last reference and be destroyed, which will trigger 
the asynchronous code that
              :     // will eventually destroy the container which is dead. 
Thus, the ordering of operations
              :     // here is very important, otherwise it is unsafe.
Nit: reword a bit:

  // We may not safely access container_ after calling log_block_.reset().
  //
  // log_block_.reset() may cause the LogBlock() to lose its last reference and 
be destroyed.
  // This may cause a registered LogBlockDeletionTransaction to lose its last 
reference and
  // be destroyed, which will trigger the asynchronous code in its destructor 
that will
  // eventually destroy the (now dead) container.


http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc@1965
PS6, Line 1965:     // The Descriptor() in the FileCache() will be destroyed
              :     // along with LogBlockContainer's destructor.
Nit: I would drop this comment. It's not really necessary to improve the 
understanding of RemoveFullContainersUnlocked; it's more useful for 
RemoveDeadContainersFromDisk.


http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc@1980
PS6, Line 1980:     DataDir* dir, const std::vector<std::string>& names) {
Don't need std:: prefixes.


http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc@1982
PS6, Line 1982:   for (const auto& name : names) {
There are some extra line formatting issues here.


http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc@2008
PS6, Line 2008:   // Remove from the memory.
Nit: "Remove the containers from memory."


http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc@2014
PS6, Line 2014:   // Remove from the disk.
Nit: "Remove the containers from disk."

(the problem with 'from the disk' is that it suggests that there's just one 
disk).


http://gerrit.cloudera.org:8080/#/c/12075/6/src/kudu/fs/log_block_manager.cc@2580
PS6, Line 2580:   // After the deletions, the data directory is sync'ed to 
reduce the chance
              :   // of a data file existing without its corresponding metadata 
file (or vice
              :   // versa) in the event of a crash. The block manager would 
treat such a case
              :   // as corruption and require manual intervention.
              :   //
              :   // TODO(adar) the above is not fool-proof; a crash could 
manifest in between
              :   // any pair of deletions. That said, the odds of it happening 
are incredibly
              :   // rare, and manual resolution isn't hard (just delete the 
existing file).
Move this into the body of RemoveDeadContainersFromDisk, just above the call to 
SyncDir, and remove the "After the deletions, ..." part of the comment.



--
To view, visit http://gerrit.cloudera.org:8080/12075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7d3672ae3c3dd9acd489120d653c44a641537f10
Gerrit-Change-Number: 12075
Gerrit-PatchSet: 6
Gerrit-Owner: helifu <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Hao Hao <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: helifu <[email protected]>
Gerrit-Comment-Date: Wed, 19 Dec 2018 05:10:28 +0000
Gerrit-HasComments: Yes

Reply via email to