Andrew Wong has posted comments on this change. Change subject: KUDU-1952 Remove round-robin for block placement ......................................................................
Patch Set 37: (88 comments) http://gerrit.cloudera.org:8080/#/c/6636/35//COMMIT_MSG Commit Message: PS35, Line 35: disk-failure toleran > nit: typo Done http://gerrit.cloudera.org:8080/#/c/6636/35/src/kudu/cfile/bloomfile-test-base.h File src/kudu/cfile/bloomfile-test-base.h: PS35, Line 74: {}, &sink)); > nit: would just '{}' be enough? Done http://gerrit.cloudera.org:8080/#/c/6636/35/src/kudu/cfile/cfile-test.cc File src/kudu/cfile/cfile-test.cc: PS35, Line 218: > nit here and below: would '{}' fit in here? Done http://gerrit.cloudera.org:8080/#/c/6636/35/src/kudu/fs/block_manager-stress-test.cc File src/kudu/fs/block_manager-stress-test.cc: PS35, Line 260: test_t > is 'this' necessary here? Done http://gerrit.cloudera.org:8080/#/c/6636/30/src/kudu/fs/block_manager-test.cc File src/kudu/fs/block_manager-test.cc: PS30, Line 192: LO > 'num_blocks_per_dir'? Done http://gerrit.cloudera.org:8080/#/c/6636/35/src/kudu/fs/block_manager-test.cc File src/kudu/fs/block_manager-test.cc: PS35, Line 137: > nit: consider adding 'static' specifier -- as I see this method does not de Templating and the closure made this a bit tricky. For now I'm leaving it but will try getting it to work. PS35, Line 155: aths); > nit: why not to pass 'num_files' itself as an argument? Done PS35, Line 186: _block_o > nit: "after 5000 runs" Done PS35, Line 205: lockM > data directories? Done PS35, Line 226: num_paths_added_to++; > could you merge this method with the method above? a lot of the code seems The assertions are a bit different in the latter half of it, but I moved the block distribution to a separate function. PS35, Line 227: total_blocks_across_paths += n > pull this somewhere common? Done PS35, Line 228: _in_each_path[path > these files are "block_containers" right? More or less. Block containers consist of two files each. PS35, Line 230: } > why not 3 in this case? Done PS35, Line 237: BlockManagerTest<LogBlockManager > why not use this in the fbm too? This is necessary here as it doesn't close the blocks until CloseBlocks(). This will mean that each block will be placed in a new container, meaning 20 containers are created. For the FBM we only need to count files as the number of blocks. Used in new impl. PS35, Line 249: ); pa > nit: number Done PS35, Line 344: > nit: this is not quite the same assertion as before. any special reason to Before, we could assert that each directory had exactly seven files because it was predictable where blocks would be placed. Now, the only thing we can know for certain is the number of blocks (and therefore the number of files) within the path. Further testing of the placement distribution is done in data_dirs-test.cc PS35, Line 447: DataDirGroupPB test_group_pb; : // Check that the in-memory DataDirGroup did not change. : ASSERT_TRUE(this->bm_->dd_manager()->GetDataDirGroupP > this seems sketchy. are you reusing state across tests? what if I only run Good call, this is in SetUp and shouldn't be replaced here. Rather, I'll just add a check that they're the same. PS35, Line 678: shar > nit: add /* bool_arg_name */ Done, and elsewhere. PS35, Line 692: share > same Done http://gerrit.cloudera.org:8080/#/c/6636/36/src/kudu/fs/block_manager-test.cc File src/kudu/fs/block_manager-test.cc: PS36, Line 107: n num_dirs http://gerrit.cloudera.org:8080/#/c/6636/35/src/kudu/fs/block_manager.h File src/kudu/fs/block_manager.h: PS35, Line 170: cks. > remove Done http://gerrit.cloudera.org:8080/#/c/6636/35/src/kudu/fs/data_dirs-test.cc File src/kudu/fs/data_dirs-test.cc: Line 17: > nit: please include at least <string> and <vector> since they are used in t Done Line 18: #include <set> > nit: please include <gtest/gtest.h> and <gflags/gflags_declare.h> Done PS35, Line 55: test_t > nit: why not just ASSERT_OK() ? Done PS35, Line 56: test_b > ditto Done PS35, Line 69: vector<string> ret; > nit: I don't see it's changing during the tests, so consider adding 'const' Done PS35, Line 70: for (int i = 0; i < num_dirs; i++ > const? Done PS35, Line 72: ret.push_back(G > nit: is it really needed as a member? Would local variable be enough where Done PS35, Line 78: tions test_block_ > nit" non_existent_tablet_opts Done PS35, Line 80: > Here and everywhere in the tests consider adding s.ToString() in case if it Done PS35, Line 96: > Would it make sense to compare pb.uuids.size() and pb_.uuids.size() prior t Done PS35, Line 102: is > Does it make sense to check for invariants on 'dd' after this call? What a No side-effects, but an assumption that its' not full. Added assert_false(is_full) PS35, Line 107: > maybe this should be names CreateDataDirGroupForTablet? As mentioned on hipchat, changed GetDataDirGroupPBForTablet to GetDataDirGroupPB PS35, Line 109: ASSERT_EQ(orig_pb.uuids(i), pb.uuids(i)); > this doesn't return Status? It doesnt, calls to DeleteDataDirGroup are idempotent so it's not an error to call DeleteDataDirGroup on a tablet that doesn't exist PS35, Line 110: > maybe this should be named LoadDataDirGroupForTabletFromPB? Same as above. PS35, Line 112: > this is weird. you mean that it won't crate a duplicate data dir group for Right, it won't create a duplicate group if one exists. PS35, Line 115: > similarly here you mean: "Tried to load DataDirGroup for tablet changed to "Tried to load DataDirGroup for tablet but one is already registered" PS35, Line 120: DataDirGroupPB orig_pb; : ASSERT_OK(dd_manager_->CreateDataDirGroup(test_tablet_name_)); : ASSERT_TRUE(dd_manager_->GetDataDirGroupPB(test_tablet_name_, &orig_pb)); : dd_manager_->DeleteDataDirGroup(test_tablet_name_); > Is there anything specific for 'dd' in the course of there calls? If yes, Added assert for non-fullness PS35, Line 125: > similar comment as in the test above Done PS35, Line 129: ASSERT_TRUE(s.IsAlreadyPresent()) << s.ToString > If changing this flag on-the-fly, does it make sense to add the 'runtime' t Mm, I don't think so. It's only set here for the test. In production, I think it should be set once? PS35, Line 140: Status s = dd_manager_->GetNextDataDir(test_block_opts_, &dd); : ASSERT_FALSE(dd->is_full()); > may make this relative then? something like While I usually agree that constants help with readability, my concern here is that making it relative might mislead thinking that these values are related to each other. I.e. there's no telling that if kNumDirs = 100, that 33 data dirs per tablet and 200 tablets would be non-flaky. PS35, Line 143: ST > refer to the constant name here and at the end of the sentence Done PS35, Line 158: hange these configs > dont think you need this static cast since the numerator is already a doubl Done PS35, Line 159: t_data_dirs_per_tab > add more info Not sure more information to add. Mean is known statically because it's a function of kNumTablets and kNumDirs. For now, I've added a printout of kNumTablets and kNumdirs. Also the number of tablets in each directory are printed in the above loop. PS35, Line 161: : // Add 'kNumTablets' tablets, each with groups of size : // 'FLAGS_fs_target_dat > hm, yea, if the distribution were normal, then a z-score of 3 would be only Done PS35, Line 161: : // Add 'kNumTablets' tablets, each with groups of size : // 'FLAGS_fs_target_dat > think we're setting ourselves up for flakyness here, unfortunately it's har I suppose by design, these two tests are not meant to change, since the numbers used here have only been empirically verified with dist-tests. Not sure what value would be added by testing a larger number of directories; that seems like it's testing the integrity of "power of two choices" itself. These tests just ensure that the selection is working as expected wrt bias and distribution. PS35, Line 235: INFO) << > nit: since this is a test, it's possible to use something like Done http://gerrit.cloudera.org:8080/#/c/6636/35/src/kudu/fs/data_dirs.cc File src/kudu/fs/data_dirs.cc: PS35, Line 435: is alr > here and below, I sort of feel like "exists" is not the right word, but rat Done PS35, Line 449: get_data_dirs_per_tablet, > A paranoid nit: what if FLAGS_fs_target_data_dirs_per_tablet is set to 2^31 switched to uint32_t PS35, Line 478: group_uuid_indices = &all_uuid_indices; > this is kinda redundant with the code Done PS35, Line 485: } > hum, can you at least change this to a CHECK? if we're missing coverage som Done CHECK makes sense, if we do happen to get here, we may end up placing data in a directory we shouldn't be. PS35, Line 504: "consider changing the fs_data_dirs_reserved_bytes configuration " : "parameter", "", ENOSPC); > nit: no need to wrap Done PS35, Line 477: AppendKeysFromMap(data_dir_by_uuid_idx_, &all_uuid_indices); : group_uuid_indices = &all_uuid_indices; : } else { : // Get the data dir group for the tablet. : DataDirGroup* group = FindOrNull(group_by_tablet_map_, opts.tablet_id); : if (group == nullptr) { : return Status::NotFound("Tried to get directory but no DataDirGroup " : "registered for tablet", opts.tablet_id); : } : group_uuid_indices = &group->uuid_indices(); : } : vector<int> random_indices(group_uuid_indices->size()); : iota(random_indices.begin(), random_indices.end(), 0); : shuffle(random_indices.begin(), random_indices.end(), default_random_engine(rng_.Next())); : : // Randomly select a member of the group that is not full. : for (int i : random_indices) { : uint16_t uuid_idx = (*group_uuid_indices)[i]; : DataDir* candidate = FindOrDie(data_dir_by_uuid_idx_, uuid_idx); : RETURN_NOT_OK(candidate->RefreshIsFull(DataDir::RefreshMode::EXPIRED_ONLY)); : if (!candidate->is_full()) { : *dir = candidate; : return Status::OK(); : } : } : return Status::IOError( : "All data directories are full. Please free some disk space or " : "consider changing the fs_data_dirs_reserved_bytes configuration " : "parameter", "", ENOSPC); : } : : void DataDirManager::DeleteDataDirGroup(const std::string& tablet_id) { : std::lock_guard<percpu_rwlock> lock(dir_group_lock_); : DataDirGroup* group = FindOrNull(group_by_tablet_map_, tablet_id); : if (group == nullptr) { : return; : } : // Remove the tablet_id from every data dir in its group. : > I guess I'm missing context, but what is this for? This is used by the block managers when they're about to place blocks. It returns a pointer to the dd_manager-owned DataDir within the group specified by 'opts'. The block manager uses this pointer to create a new block/container/etc within the dir. This is the function that was round robining before this patch. Contrast this with GetDirForGroupUnlocked, which is used by the dd_manager to select which dirs to add to a new DataDirGroup. PS35, Line 541: while (group_indices->size() < target_size && !candidate_indices.empty()) { : shuffle(candidate_indices.begin(), candidate_indices.end(), default_random_engine(rng_.Next())); : if (candidate_indices.size() == 1 || : FindOrDie(tablets_by_uuid_idx_map_, candidate_indices[0]).size() < : FindOrDie(tablets_by_uuid_idx_map_, candidate_indices[1]).size()) { : group_indices->push_back(candidate_indices[0]); : candidate_indices.erase(candidate_indices.begin()); : } else { : group_indices->push_back(candidate_indices[1]); : candidate_indices.erase(candidate_indices.begin() + 1); : } : } : return Status::OK(); : } : : DataDir* DataDirManager::FindDataDirByUuidIndex(uint16_t uuid_idx) const { : return FindPtrOrNull(data_dir_by_uuid_idx_, uuid_idx); : } : : bool DataDirManager::FindUuidIndexByDataDir(DataDir* dir, uint16_t* uuid_idx) const { : return FindCopy(uuid_idx_by_data_dir_, dir, uuid_idx); : } : : } // namespace fs : } // namespace kudu : : : : : : : : > wouldn't it be simpler and faster to get all the data dirs in one go? Agreed, that's much easier to follow. http://gerrit.cloudera.org:8080/#/c/6636/32/src/kudu/fs/data_dirs.h File src/kudu/fs/data_dirs.h: Line 69: explicit DataDirGroup(std::vector<uint16_t> uuid_indices) > warning: loop variable is copied but only used as const reference; consider Done http://gerrit.cloudera.org:8080/#/c/6636/35/src/kudu/fs/data_dirs.h File src/kudu/fs/data_dirs.h: PS35, Line 59: > typo Done PS35, Line 65: > nit: I don't think you need to refer to uuids all over this class, likely d Hrm, what would the difference between a uuid_index and a dd_index be? Went with uuid_index because it's already an existing term in DataDirManager and in describing pathsets. PS35, Line 69: <uint16_t> uuid_indices) > nit: typedef? Done PS35, Line 69: > ..except here of course where you use "dd_uuid_to_idx_map Left as is but with the typedef. Within the context of this internal class (and within DataDirManager) uuid_idx is already established, adding to it might be confusing. PS35, Line 87: > nit: consider adding 'const' specified for this method. Done PS35, Line 221: > see my comments on the test about these method's names changed GetDataDirGroupPB() PS35, Line 225: // v > nit: would the caller ever be interested to know if any group were effectiv Right now, none of the call sites need to know. DeleteDataDirGroup() is called through the tablet metadata, and it can be called multiple times per tablet (e.g. tablet data is tombstoned first and then deleted), so it's not an error to try to delete a tablet_id that's been deleted. PS35, Line 254: uint16_t* uuid_idx) const; : : const std::vector<std::unique_ptr<DataDir>>& data_dirs() const { : return data_dirs_; : } > want to mention briefly what are the average case properties that you expec Done PS35, Line 291: > oh you do have a typedef. move it/fwd declare somewhere you can use it also Done PS35, Line 301: the dir group maps. : // A percpu_rwlock is used so threads attempting to read (e.g. to get the : // next data directory for a Flush()) do no > this is hard to parse exchanged some commas with parens http://gerrit.cloudera.org:8080/#/c/6636/36/src/kudu/fs/data_dirs.h File src/kudu/fs/data_dirs.h: PS36, Line 49: std::unordered_map<uint16_t, std::string> UuidByUuidIndexMap; : typedef s move to after fwd decl PS36, Line 94: the remove PS36, Line 224: irGroup, allowing for backwards compatability with data from older : // version of Kudu. : // update PS36, Line 237: zes t remove http://gerrit.cloudera.org:8080/#/c/6636/35/src/kudu/fs/fs.proto File src/kudu/fs/fs.proto: PS35, Line 114: is > nit: I know that whether data is plural or singular a hot topic among gramm Done http://gerrit.cloudera.org:8080/#/c/6636/35/src/kudu/fs/log_block_manager.h File src/kudu/fs/log_block_manager.h: PS35, Line 104: > remove Done http://gerrit.cloudera.org:8080/#/c/6636/35/src/kudu/tablet/diskrowset.cc File src/kudu/tablet/diskrowset.cc: PS35, Line 93: const string& ta > const string& tablet_id ? Done Line 112: const string& tablet_id = rowset_metadata_->tablet_metadata()->tablet_id(); > ditto Done PS35, Line 127: const string& ta > ditto Done http://gerrit.cloudera.org:8080/#/c/6636/35/src/kudu/tablet/metadata.proto File src/kudu/tablet/metadata.proto: PS35, Line 127: is > :) :''') PS35, Line 129: from a version of Kudu before 1 > mention a specific version Done http://gerrit.cloudera.org:8080/#/c/6636/35/src/kudu/tablet/multi_column_writer.cc File src/kudu/tablet/multi_column_writer.cc: Line 52: const CreateBlockOptions block_opts({ tablet_id_ }); > nit: const? Done http://gerrit.cloudera.org:8080/#/c/6636/35/src/kudu/tablet/multi_column_writer.h File src/kudu/tablet/multi_column_writer.h: PS35, Line 20: #include <map> > nit: move this after the std headers. Done PS35, Line 90: > nit: const? Done http://gerrit.cloudera.org:8080/#/c/6636/35/src/kudu/tablet/tablet_bootstrap-test.cc File src/kudu/tablet/tablet_bootstrap-test.cc: PS35, Line 207: Status s = BootstrapTestTablet(-1, -1, &tablet, &boot_info); : ASSERT_TRUE(s.IsCorruption()) << "Expected corruption: " << s.ToString(); : ASSERT_STR_CONTAINS(s.ToString(), "TabletMetadata bootstrap state is TABLET_DATA_C > Not sure about this change. Ah, I see, so the in-memory DataDirGroup wouldn't be around. Makes sense, I'll leave the original expected error. http://gerrit.cloudera.org:8080/#/c/6636/35/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: PS35, Line 86: > can you PREPEND here to give better context? same below Done PS35, Line 199: } : : // Keep a copy of the old data dir group in case of flush failure. : DataDirGroupPB pb; : bool old_group_exists = fs_manager_->dd_manager()->GetDataDirGroupPB(tablet_id_, &pb); : : // Remove the tablet's data dir group tracked by the DataDirManager. : fs_manager_->dd_manager()->DeleteDataDirGroup(tablet_id_); : auto revert_group_cleanup = MakeScopedCleanup([&]() { : if (old_group_exists) { : fs_manager_->dd_manager()->LoadDataDirGroupFromPB(tablet_id_, pb); : } : }); : : // Flushing will sync the new tablet_data_state_ to disk and will now also : // delete all the data. : RETURN_NOT_OK(Flush()); : rev > Instead of doing this here, would it work to instead add a new method like Hrmm, the important thing would be to ensure that the in-memory DataDirGroup is updated before the blocks are downloaded. Moved the CreateDataDirGroup() to right after the call to DeleteTabletData in TabletCopyClient, since at that point, there will be no DataDirGroup for the tablet on-disk or in-memory (no need for a "Reallocate"). We still should be cleaning up the deletion if the Flush() fails. Line 388: // data is not TABLET_DATA_READY, group creation is pointless, as the > PREPEND Done PS35, Line 393: RETURN_NOT_OK > typo Done PS35, Line 398: > we try to avoid boolean parameters to change behavior. Even though it seems Done PS35, Line 629: > Is it worth to check for the return value of GetDataDirGroupPBForTablet()? Yeah, this addresses the issue todd mentioned. PS35, Line 629: > note that this will end up adding an empty data_dir_group field to 'pb' eve You're right, that's not intended. http://gerrit.cloudera.org:8080/#/c/6636/36/src/kudu/tserver/tablet_copy_client.cc File src/kudu/tserver/tablet_copy_client.cc: PS36, Line 265: _id_, superblock_->mutable_data_dir_group()); : spacing -- To view, visit http://gerrit.cloudera.org:8080/6636 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9828147f4fa5c4d7f6ed23441dca5a116b8cb11b Gerrit-PatchSet: 37 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Jean-Daniel Cryans <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
