Adar Dembo has posted comments on this change.

Change subject: KUDU-1943: Add BlockTransaction to Block Manager
......................................................................


Patch Set 19:

(14 comments)

I think I reviewed everything, but I'm curious to see how this will evolve 
following my suggestions.

http://gerrit.cloudera.org:8080/#/c/7207/19//COMMIT_MSG
Commit Message:

PS19, Line 22: :xxxx
Nit: can just omit this.


http://gerrit.cloudera.org:8080/#/c/7207/19/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

Line 245:   typedef int CloseFlags;
Don't need this; just change "enum CloseFlag" to "enum CloseFlags".


Line 499:   int64_t next_sync_offset() const { return next_sync_offset_.Load(); 
}
Don't need this; it's only ever used internally (by LogBlockContainer).


Line 582:   // The offset of the container that next sync should start with.
Can we get by with a boolean? Seems like we could set it in RoundUpContainer 
when the next_block_offset_ increases and clear it when we sync.


PS19, Line 594:   // Protects sync data operation to avoid unnecessary I/O.
              :   Mutex lock_;
I don't understand what protection the lock offers. Furthermore, 
PosixRWFile::Sync() already keeps track of whether a sync is actually necessary.

Dan and I had a conversation about PosixRWFile::Sync. The use of pending_sync_ 
as an optimization is unsafe as it can cause the "losing" thread in a race to 
return early under the assumption that the data has been made durable when it 
hasn't (yet). I filed KUDU-2102 for this; feel free to tackle that next if you 
like.


Line 906:   auto cleanup = MakeScopedCleanup([&]() {
You should add a note here about mutating result_status on failure, like in 
LogWritableBlock::DoClose().


Line 939:   int64_t offset = mode == ROUND_UP ? next_block_offset() : 
block_offset;
I don't understand this, and it's making me nervous.

Let's simplify by exposing and using LogWritableBlock's offset. Hopefully we 
can do it without modifying WritableBlock itself (may need to down_cast earlier 
and deal with LogWritableBlocks through and through).


PS19, Line 955:   if (mode == ROUND_UP) {
              :     WARN_NOT_OK(TruncateDataToNextBlockOffset(),
              :                 "could not truncate excess preallocated space");
              : 
              :     if (full() && block_manager()->metrics()) {
              :       block_manager()->metrics()->full_containers->Increment();
              :     }
              :   }
Here's another case where, even though FinishBlock() can be called 
concurrently, we would never want to call TruncateDataToNextBlockOffset() 
concurrently, and the ROUND_UP check prevents us from doing that. But it's 
tricky to follow 'mode' through the various call chains to prove this guarantee.


Line 1076:     if (next_sync_offset() < cur_next_block_offset) {
These two accesses to next_sync_offset_ aren't atomic. Use StoreMax() instead.

But since you can make this a bool, a simple CAS should do the trick. Look at 
how PosixRWFile uses pending_sync_.


PS19, Line 1173:   int64_t new_next_block_offset = KUDU_ALIGN_UP(
               :       block_offset + block_length,
               :       instance()->filesystem_block_size_bytes());
               :   if (PREDICT_FALSE(new_next_block_offset < 
next_block_offset())) {
               :     LOG(WARNING) << Substitute(
               :         "Container $0 unexpectedly tried to lower its next 
block offset "
               :         "(from $1 to $2), ignoring",
               :         ToString(), next_block_offset(), 
new_next_block_offset);
               :   } else {
               :     next_block_offset_.Store(new_next_block_offset);
               :   }
I keep circling back to this as a potential synchronization issue. The 
comparison and subsequent assignment to next_block_offset_ is not atomic. It 
didn't need to be prior to your change because BlockCreated() was guaranteed to 
be called by just one thread at a time. It may not need to be now, but I'm 
finding it really tough (based on tracing through the code paths) to guarantee 
that, due to the proliferation of RoundMode in various deep methods.

I think this would be safer if we used StoreMax() instead. And maybe 
total_bytes_ and total_blocks_ should be atomic too.


Line 1188: int64_t LogBlockContainer::fs_aligned_length(int64_t block_offset,
Surely we needn't duplicate this from LogBlock; if you need it at the container 
level (for operations where LogBlocks may not exist), can't we move it instead?

In general please try a little harder to avoid duplicating code, especially 
code with large comments like this.


PS19, Line 1464:     if ((flags & SYNC) &&
               :         (state_ == CLEAN || state_ == DIRTY || state_ == 
FINALIZED)) {
               :       VLOG(3) << "Syncing block " << id();
               : 
               :       // TODO(unknown): Sync just this block's dirty data.
               :       s = container_->SyncData();
               :       RETURN_NOT_OK(s);
               : 
               :       // Append metadata only after data is synced to ensure 
metadata lands
               :       // on disk only after the corresponding data does so at 
any given point.
               :       s = AppendMetadata();
               :       RETURN_NOT_OK_PREPEND(s, "unable to flush block during 
close");
               : 
               :       // TODO(unknown): Sync just this block's dirty metadata.
               :       s = container_->SyncMetadata();
               :       RETURN_NOT_OK(s);
               :     } else if ((flags == FINISH) &&
               :                (state_ == CLEAN || state_ == DIRTY || state_ 
== FINALIZED)) {
               :       s = AppendMetadata();
               :       RETURN_NOT_OK_PREPEND(s, "unable to flush block during 
close");
               :     }
               :   }
Rewrite:

  if (flags & SYNC) {
    SyncData();
  }
  AppendMetadata();
  if (flags & SYNC) {
    SyncMetadata();
  }

I don't think you need to check state_ at all. There are four possible states 
(CLEAN, DIRTY, FINALIZED, and CLOSED) and we know we're not CLOSED because of 
the check on L1427.


Line 1905:   // Close all blocks and sync the blocks belonging to the same
I think this control flow (and the corresponding surgery to DoClose) is more 
complex than it needs to be. A couple things that made me think this way:
1. It's difficult to keep track of RoundMode from BlockCreated and up through 
to the various call-sites (including DoClose).
2. DoClose(NONE), as far as I can tell, just marks the block as CLOSED and 
updates some metrics. It calls Finalize too, but that's already been done on 
L1901, so it's redundant.

I keep coming back to DoClose() as being too complicated, and I think this 
stems from how you've plumbed the handling of  closing groups of blocks. What 
if we inverted this, so that closing one block actually just closing a 
singleton vector of blocks? Then we could remove DoClose() altogether and have 
Close() chain to LogBlockManager::DoCloseBlocks({ this }, SYNC). Abort() would 
call DoCloseBlocks({ this }, NO_SYNC).

DoCloseBlocks() would do:
1. Make a per-container map of blocks.
2. If SYNC, for each container, sync the data.
3. For each container, for each block of that container, append the metadata.
4. If SYNC, for each container, sync the metadata.
5. If SYNC, sync dirty data dirs if necessary (see SyncContainer).
5. For each block, actually "finish" the block (i.e. update metrics, etc.)

You can see how for the single-block case, this is effectively the same thing 
as what DoClose() does today.

Then CloseBlocks() would call DoCloseBlocks( { ... }, SYNC).

I think this will also help reduce the duplication in SyncBlocks() and 
FinishBlock().


PS19, Line 1910: lwr
Nit: lwb


-- 
To view, visit http://gerrit.cloudera.org:8080/7207
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I01f3ed830cc1f86d7bb8a66b2f780914a88884bd
Gerrit-PatchSet: 19
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Hao Hao <hao....@cloudera.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <danburk...@apache.org>
Gerrit-Reviewer: Hao Hao <hao....@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

Reply via email to