Adar Dembo has posted comments on this change. Change subject: KUDU-1943: Add BlockTransaction to Block Manager ......................................................................
Patch Set 19: (14 comments) I think I reviewed everything, but I'm curious to see how this will evolve following my suggestions. http://gerrit.cloudera.org:8080/#/c/7207/19//COMMIT_MSG Commit Message: PS19, Line 22: :xxxx Nit: can just omit this. http://gerrit.cloudera.org:8080/#/c/7207/19/src/kudu/fs/log_block_manager.cc File src/kudu/fs/log_block_manager.cc: Line 245: typedef int CloseFlags; Don't need this; just change "enum CloseFlag" to "enum CloseFlags". Line 499: int64_t next_sync_offset() const { return next_sync_offset_.Load(); } Don't need this; it's only ever used internally (by LogBlockContainer). Line 582: // The offset of the container that next sync should start with. Can we get by with a boolean? Seems like we could set it in RoundUpContainer when the next_block_offset_ increases and clear it when we sync. PS19, Line 594: // Protects sync data operation to avoid unnecessary I/O. : Mutex lock_; I don't understand what protection the lock offers. Furthermore, PosixRWFile::Sync() already keeps track of whether a sync is actually necessary. Dan and I had a conversation about PosixRWFile::Sync. The use of pending_sync_ as an optimization is unsafe as it can cause the "losing" thread in a race to return early under the assumption that the data has been made durable when it hasn't (yet). I filed KUDU-2102 for this; feel free to tackle that next if you like. Line 906: auto cleanup = MakeScopedCleanup([&]() { You should add a note here about mutating result_status on failure, like in LogWritableBlock::DoClose(). Line 939: int64_t offset = mode == ROUND_UP ? next_block_offset() : block_offset; I don't understand this, and it's making me nervous. Let's simplify by exposing and using LogWritableBlock's offset. Hopefully we can do it without modifying WritableBlock itself (may need to down_cast earlier and deal with LogWritableBlocks through and through). PS19, Line 955: if (mode == ROUND_UP) { : WARN_NOT_OK(TruncateDataToNextBlockOffset(), : "could not truncate excess preallocated space"); : : if (full() && block_manager()->metrics()) { : block_manager()->metrics()->full_containers->Increment(); : } : } Here's another case where, even though FinishBlock() can be called concurrently, we would never want to call TruncateDataToNextBlockOffset() concurrently, and the ROUND_UP check prevents us from doing that. But it's tricky to follow 'mode' through the various call chains to prove this guarantee. Line 1076: if (next_sync_offset() < cur_next_block_offset) { These two accesses to next_sync_offset_ aren't atomic. Use StoreMax() instead. But since you can make this a bool, a simple CAS should do the trick. Look at how PosixRWFile uses pending_sync_. PS19, Line 1173: int64_t new_next_block_offset = KUDU_ALIGN_UP( : block_offset + block_length, : instance()->filesystem_block_size_bytes()); : if (PREDICT_FALSE(new_next_block_offset < next_block_offset())) { : LOG(WARNING) << Substitute( : "Container $0 unexpectedly tried to lower its next block offset " : "(from $1 to $2), ignoring", : ToString(), next_block_offset(), new_next_block_offset); : } else { : next_block_offset_.Store(new_next_block_offset); : } I keep circling back to this as a potential synchronization issue. The comparison and subsequent assignment to next_block_offset_ is not atomic. It didn't need to be prior to your change because BlockCreated() was guaranteed to be called by just one thread at a time. It may not need to be now, but I'm finding it really tough (based on tracing through the code paths) to guarantee that, due to the proliferation of RoundMode in various deep methods. I think this would be safer if we used StoreMax() instead. And maybe total_bytes_ and total_blocks_ should be atomic too. Line 1188: int64_t LogBlockContainer::fs_aligned_length(int64_t block_offset, Surely we needn't duplicate this from LogBlock; if you need it at the container level (for operations where LogBlocks may not exist), can't we move it instead? In general please try a little harder to avoid duplicating code, especially code with large comments like this. PS19, Line 1464: if ((flags & SYNC) && : (state_ == CLEAN || state_ == DIRTY || state_ == FINALIZED)) { : VLOG(3) << "Syncing block " << id(); : : // TODO(unknown): Sync just this block's dirty data. : s = container_->SyncData(); : RETURN_NOT_OK(s); : : // Append metadata only after data is synced to ensure metadata lands : // on disk only after the corresponding data does so at any given point. : s = AppendMetadata(); : RETURN_NOT_OK_PREPEND(s, "unable to flush block during close"); : : // TODO(unknown): Sync just this block's dirty metadata. : s = container_->SyncMetadata(); : RETURN_NOT_OK(s); : } else if ((flags == FINISH) && : (state_ == CLEAN || state_ == DIRTY || state_ == FINALIZED)) { : s = AppendMetadata(); : RETURN_NOT_OK_PREPEND(s, "unable to flush block during close"); : } : } Rewrite: if (flags & SYNC) { SyncData(); } AppendMetadata(); if (flags & SYNC) { SyncMetadata(); } I don't think you need to check state_ at all. There are four possible states (CLEAN, DIRTY, FINALIZED, and CLOSED) and we know we're not CLOSED because of the check on L1427. Line 1905: // Close all blocks and sync the blocks belonging to the same I think this control flow (and the corresponding surgery to DoClose) is more complex than it needs to be. A couple things that made me think this way: 1. It's difficult to keep track of RoundMode from BlockCreated and up through to the various call-sites (including DoClose). 2. DoClose(NONE), as far as I can tell, just marks the block as CLOSED and updates some metrics. It calls Finalize too, but that's already been done on L1901, so it's redundant. I keep coming back to DoClose() as being too complicated, and I think this stems from how you've plumbed the handling of closing groups of blocks. What if we inverted this, so that closing one block actually just closing a singleton vector of blocks? Then we could remove DoClose() altogether and have Close() chain to LogBlockManager::DoCloseBlocks({ this }, SYNC). Abort() would call DoCloseBlocks({ this }, NO_SYNC). DoCloseBlocks() would do: 1. Make a per-container map of blocks. 2. If SYNC, for each container, sync the data. 3. For each container, for each block of that container, append the metadata. 4. If SYNC, for each container, sync the metadata. 5. If SYNC, sync dirty data dirs if necessary (see SyncContainer). 5. For each block, actually "finish" the block (i.e. update metrics, etc.) You can see how for the single-block case, this is effectively the same thing as what DoClose() does today. Then CloseBlocks() would call DoCloseBlocks( { ... }, SYNC). I think this will also help reduce the duplication in SyncBlocks() and FinishBlock(). PS19, Line 1910: lwr Nit: lwb -- To view, visit http://gerrit.cloudera.org:8080/7207 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I01f3ed830cc1f86d7bb8a66b2f780914a88884bd Gerrit-PatchSet: 19 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Hao Hao <hao....@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Dan Burkert <danburk...@apache.org> Gerrit-Reviewer: Hao Hao <hao....@cloudera.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes