Adar Dembo has posted comments on this change. Change subject: [c++-client]: cache non-covering ranges in meta cache ......................................................................
Patch Set 5: (17 comments) http://gerrit.cloudera.org:8080/#/c/3581/5/src/kudu/client/client-test.cc File src/kudu/client/client-test.cc: PS5, Line 1223: int original_table_locations_ttl_ms = FLAGS_table_locations_ttl_ms; : FLAGS_table_locations_ttl_ms = new_ttl; : auto cleanup = MakeScopedCleanup([&] () { : FLAGS_table_locations_ttl_ms = original_table_locations_ttl_ms; : }); You can instantiate a FlagSaver instead. Line 1238: ASSERT_FALSE(entry.stale()); The TTL is 25ms, so it's conceivable that on a bogged down test environment the entry may become stale between the time that it is inserted and the time that stale() is called. Could you loop this test 1000 times with TSAN and see if it's indeed flaky? Alternatively we could raise the TTL but that just means a longer sleep below. Or perhaps this (and L1246-1247) should be done in a loop and accepted provided one iteration passes. http://gerrit.cloudera.org:8080/#/c/3581/5/src/kudu/client/client.h File src/kudu/client/client.h: Line 300: FRIEND_TEST(ClientTest, TestMetaCacheExpiry); Nit: sort alphabetically. http://gerrit.cloudera.org:8080/#/c/3581/5/src/kudu/client/meta_cache.cc File src/kudu/client/meta_cache.cc: PS5, Line 753: if (new_status.IsServiceUnavailable()) { : // One or more of the tablets is not running; retry after a backoff period. : mutable_retrier()->DelayedRetry(this, new_status); : ignore_result(delete_me.release()); : return; : } To be clear, this has nothing to do with non-covering range partition support, right? This should have been added as part of 30d76a3? As an analogue of 1a97c42? Line 790: if (rpc.resp().tablet_locations().empty()) { rpc.resp().tablet_locations() is called four times, maybe pull it out into a cref local variable? Line 803: string last_upper_bound = If possible, could you add an ASCII art diagram to this section that will visually map tablets (or non-covered ranges) to sections of code that handle them? For example: // a b c b c b // +------+ +-----+ +-----+ // | | | | | | // +------+ +-----+ +-----+ And then in the code comments, add references to a), b), c) etc. as needed. PS5, Line 817: tablet.partition().partition_key_start() Can you pull this (and end) into local variables? PS5, Line 848: FindOrDie(tablets_by_key, tablet.partition().partition_key_start()); This assumes that a tablet can't change in partition size (i.e. that should mean a new tablet), right? Since the partition information comes off the wire, it might be better to handle that weird state gracefully. Alternatively, if tablets_by_id_ values were MetaCacheEntries instead of RemoteTablets, you'd have the TTL right there. But then you'd need to share MetaCacheEntries amongst the two maps (using shared ownership); not sure if that's net less complexity or not. PS5, Line 850: entry.upper_bound_partition_key() == tablet.partition().partition_key_end()); This part also depends on well-formed information off the wire. PS5, Line 865: MetaCacheEntry entry(expiration_time, remote); : VLOG(3) << "Caching '" << rpc.table_name() << "' entry " << entry.DebugString(rpc.table()); : : InsertOrDie(&tablets_by_id_, tablet_id, remote); : InsertOrDie(&tablets_by_key, partition.partition_key_start(), std::move(entry)); Maybe encapsulate this (and the non-covered range variant) in a helper method? PS5, Line 874: // There is a non-covered range between the last tablet upper bound and : // the end of the partition key space. What if the total number of tablets is a multiple of MAX_RETURNED_TABLE_LOCATIONS and we just did a lookup for the last batch of tablets? There may be a non-covered range at the end, but we won't insert an entry for it here. Is that a problem? Line 885: *cache_entry = *DCHECK_NOTNULL(FindFloorOrNull(tablets_by_key, rpc.partition_key())); Why FindFloorOrNull if you DCHECK_NOTNULL() right after? Seems unsafe. http://gerrit.cloudera.org:8080/#/c/3581/1/src/kudu/client/meta_cache.h File src/kudu/client/meta_cache.h: Line 283: const scoped_refptr<RemoteTablet>& tablet() const { > That wouldn't type check. I'm not seeing a warning with clang or gcc 4.9, Sure, or DCHECK(tablet_.get()) http://gerrit.cloudera.org:8080/#/c/3581/5/src/kudu/client/table-internal.cc File src/kudu/client/table-internal.cc: PS5, Line 133: if (s.ok()) { : break; : } else { : LOG(WARNING) << "Error getting table locations: " << s.ToString() << ", retrying."; : } Likewise, this isn't logically part of this change, right? http://gerrit.cloudera.org:8080/#/c/3581/5/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: Line 150: DEFINE_int32(table_locations_ttl_ms, 36000000, How about expressing this as: 60 * 60 * 1000 // 1 hour http://gerrit.cloudera.org:8080/#/c/3581/5/src/kudu/tools/ksck_remote.cc File src/kudu/tools/ksck_remote.cc: Line 283: SleepFor(MonoDelta::FromSeconds(retries)); That seems like a rather high sleep value. Does it cause the ksck tests to noticeably slow down? Maybe use MonoDelta::FromMilliseconds(retries * 100)? Line 320: new KsckTabletReplica(replica.ts_info().permanent_uuid(), is_leader, is_follower))); Nit: wasn't the old indentation correct? -- To view, visit http://gerrit.cloudera.org:8080/3581 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I05bcb3fe05d51d7c455e1d68bd2baa6f3c2b9d21 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: Jean-Daniel Cryans <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
