Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12037 )
Change subject: IMPALA-7928: Consistent remote read scheduling ...................................................................... Patch Set 5: (42 comments) http://gerrit.cloudera.org:8080/#/c/12037/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12037/5//COMMIT_MSG@15 PS5, Line 15: simluated nit: typo http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/experiments/hash-ring-util.cc File be/src/experiments/hash-ring-util.cc: http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/experiments/hash-ring-util.cc@39 PS5, Line 39: hashspace nit: hashspace or hash space (line below) http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/experiments/hash-ring-util.cc@44 PS5, Line 44: HashRingUtil Can you think of a more descriptive name? HashRingConsistencyCheck or similar? http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/backend-config.h File be/src/scheduling/backend-config.h: http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/backend-config.h@33 PS5, Line 33: /// hostnames to IP addresses. Maybe mention the hash ring in the class comment, too? http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/backend-config.h@83 PS5, Line 83: HashRing backend_ip_hashring_; nit: hash_ring? Or Hashring? http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/backend-config.cc File be/src/scheduling/backend-config.cc: http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/backend-config.cc@23 PS5, Line 23: static const uint32_t NUM_HASHRING_REPLICAS = 25; How did you pick 25? Does this number have to have any special properties, e.g. not be a power of 2? http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/hash-ring-test.cc File be/src/scheduling/hash-ring-test.cc: http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/hash-ring-test.cc@42 PS5, Line 42: void VerifyTotalAllocation(const vector<TNetworkAddress>& addresses, I think these methods could use a comment, some of them weren't clear to me from their names alone. http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/hash-ring.h File be/src/scheduling/hash-ring.h: http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/hash-ring.h@31 PS5, Line 31: TNetworkAddress For the purpose of scheduling we use IP addresses because whether a read is local or remote does not depend on the port. Should we use the same type here, too? http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/hash-ring.h@45 PS5, Line 45: /// node is remapped. This allows for bounded disruption. Should we quantify the bound? http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/hash-ring.h@64 PS5, Line 64: for (auto it = hash_ring.node_hashmap_.begin(); nit: you could use a range-based for loop, here and elsewhere. http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/hash-ring.h@76 PS5, Line 76: HashRing(HashRing&&) = delete; Deleting a ctor seems uncommon enough to warrant a comment. http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/hash-ring.h@97 PS5, Line 97: /// this TNetworkAddress. This is useful for examining how the hash space is balanced Can you include in the comment how the portion is represented in the int64_t, i.e. that it is sum of all range sizes? http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/hash-ring.h@99 PS5, Line 99: void GetDistributionMap(std::map<TNetworkAddress, uint64_t>& distribution_map) const; Pass output parameter by pointer. http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/hash-ring.h@106 PS5, Line 106: std::set<TNetworkAddress> nodes_; Should we do this in the backend config, possibly in a vector, and then just store indexes in this class? The scheduler's performance also suffers from the prolific use of strings and network addresses in various maps and could benefit from such a mapping in BackendConfig. http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/hash-ring.h@111 PS5, Line 111: /// of the two underlying TNetworkAddress's. Collisions should be rare, so this will This looks like the birthday paradox and collisions might be more common that intuition suggests. Hashing 372 nodes 25 times each to 32bit would have a >1% probability of a collision. Please also see my comment in the implementation of AddNode on whether a collision of a node means that the subsequent hashes would also collide. http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/hash-ring.h@113 PS5, Line 113: std::map<uint32_t, NodeIterator> node_hashmap_; It seems a bit confusing that a ordered map is called hash_map. Can you think of a different name, e.g. hash_to_node_? http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/hash-ring.cc File be/src/scheduling/hash-ring.cc: http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/hash-ring.cc@57 PS5, Line 57: node_hashmap_[hash_val] = node_it; If there's a collision, then another node hashed to the same hash_val, right? Does that mean that subsequent hashes returned by the hashgenerator would also be the same? Does one node effectively kick out the other altogether then? http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/hash-ring.cc@83 PS5, Line 83: DCHECK_GT(node_hashmap_.size(), 0); !node_hashmap.empty() http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/hash-ring.cc@104 PS5, Line 104: TNetworkAddress addr = *(it->second); nit: consistent naming, here vs ip_addr below http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/hash-ring.cc@107 PS5, Line 107: if (distribution_map.find(addr) != distribution_map.end()) { : distribution_map[addr] += range; : } else { : distribution_map[addr] = range; : } I think the map default-initializes when using [], so you can just do map[addr] += range without the check. http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/hash-ring.cc@115 PS5, Line 115: UINT_MAX If you use (1<<32) here, then the two parts of the bucket that wraps from UINT_MAX back to 0 should add up to the correct number of elements, no? http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler-test-util.h File be/src/scheduling/scheduler-test-util.h: http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler-test-util.h@271 PS5, Line 271: query_options_.num_simulated_remote_replicas = num; nit: move to .cc file if it doesn't fit into a single line, like SetReplicaPreference() above. http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler-test-util.cc File be/src/scheduling/scheduler-test-util.cc: http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler-test-util.cc@531 PS5, Line 531: return host.ip + ":" + std::to_string(host.be_port); I'm surprised this is necessary, are the IP addresses generated by HostIdxToIpAddr not unique? http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler-test.cc File be/src/scheduling/scheduler-test.cc: http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler-test.cc@220 PS5, Line 220: for (int i = 0; i < num_data_nodes; ++i) cluster.AddHost(false, true); There's also cluster.AddHosts which does the same as these two loops http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler-test.cc@229 PS5, Line 229: for (int32_t num_replicas = 1; num_replicas <= num_impala_nodes + 1; num_replicas++) { nit: use int and pre-increment to stay consistent within the file. http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler-test.cc@248 PS5, Line 248: all blocks The file has 3 blocks, isn't it all scan ranges that are assigned to the same backend? http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler-test.cc@293 PS5, Line 293: } you could assert that the number of backends is 1 less after the loop or that at least RemoveBackend was called once http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler.h File be/src/scheduling/scheduler.h: http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler.h@189 PS5, Line 189: simulated_remote_replicas When I first read this it wasn't entirely clear to me from the comment what a simulated replica is. Would a name like "GetRemoteReplicaCandidates" describe the functionality, too? http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler.h@191 PS5, Line 191: /// the file name from 'hdfs_file_split' multiple times and look up the closest Have you considered passing the offset into the hash, too? Would that increase parallelism for large files? http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler.h@198 PS5, Line 198: int32_t num_remote_replicas, vector<IpAddr>& simulated_remote_replicas); We usually pass output parameters by pointer http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler.cc File be/src/scheduling/scheduler.cc: http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler.cc@708 PS5, Line 708: num_simulated_remote_replicas, nit: line wrapping http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler.cc@904 PS5, Line 904: const THdfsFileSplit* hdfs_file_split, int32_t num_remote_replicas, I think we usually just use "int" instead of int32_t, here and elsewhere. http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler.cc@907 PS5, Line 907: DCHECK_EQ(simulated_remote_replicas.size(), 0); You could call reserve(num_remote_replicas) on simulated_remote_replicas, though it probably won't matter much. http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler.cc@914 PS5, Line 914: set<TNetworkAddress> distinct_backends; Does the algorithm require the set to be ordered? http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler.cc@919 PS5, Line 919: uint32_t max_iterations = num_remote_replicas * 2; We usually don't use unsigned types per convention unless required by some interface. "int" should work here, too. http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler.cc@920 PS5, Line 920: for (uint32_t i = 0; i < max_iterations; i++) { I think we usually use pre-increment. http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler.cc@930 PS5, Line 930: for (auto it = distinct_backends.begin(); it != distinct_backends.end(); it++) { nit: use a range based loop? http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/scheduling/scheduler.cc@931 PS5, Line 931: const TBackendDescriptor *be_desc = executors_config_.LookUpBackendDesc(*it); nit: move space to right side of * http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/service/query-options.h File be/src/service/query-options.h: http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/service/query-options.h@149 PS5, Line 149: TQueryOptionLevel::DEVELOPMENT)\ I would prefer "advanced" since this change sets a value of 3 which will change the existing behavior. Getting the current behavior back would then require to change a development option, which sounds more fragile and unstable than it is. http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/util/hash-util.h File be/src/util/hash-util.h: http://gerrit.cloudera.org:8080/#/c/12037/5/be/src/util/hash-util.h@282 PS5, Line 282: This is intended to be a slight improvement on that. Can you mention the benefits of doing that in the comment? http://gerrit.cloudera.org:8080/#/c/12037/5/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/12037/5/common/thrift/ImpalaService.thrift@355 PS5, Line 355: // The number of simulated replicas to use when scheduling remote HDFS ranges. The word "simulated" feels unintuitive to me, how about NUM_REMOTE_READ_CANDIDATES or NUM_REMOTE_SCANNER_CANDIDATES or something similar? http://gerrit.cloudera.org:8080/#/c/12037/5/common/thrift/ImpalaService.thrift@357 PS5, Line 357: will ... that can read ...? -- To view, visit http://gerrit.cloudera.org:8080/12037 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbf74088a8bd8c285ab7285ea3a01acd1bb53a45 Gerrit-Change-Number: 12037 Gerrit-PatchSet: 5 Gerrit-Owner: Joe McDonnell <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Philip Zeyliger <[email protected]> Gerrit-Comment-Date: Fri, 01 Feb 2019 18:55:20 +0000 Gerrit-HasComments: Yes
