Michael Smith has posted comments on this change. ( http://gerrit.cloudera.org:8080/18219 )
Change subject: IMPALA-11064 Optimizing Temporary File Structure for Batch Reading ...................................................................... Patch Set 5: (18 comments) http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/exec/partitioned-hash-join-builder.h File be/src/exec/partitioned-hash-join-builder.h: http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/exec/partitioned-hash-join-builder.h@80 PS5, Line 80: /// left shift 16 bits for the later specified partition id. PartitionId is an int, which can be 32-bits. Should we left shift 32 bits here? http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/exec/partitioned-hash-join-builder.h@82 PS5, Line 82: static uint64_t GenerateBaseSpillId(int64_t level, uint64_t default_id) { Can you add a small unit test for this? http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/buffered-tuple-stream.h File be/src/runtime/buffered-tuple-stream.h: http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/buffered-tuple-stream.h@422 PS5, Line 422: void ResetSpillId() { spill_id_ = 0; } This seems to be unused. When would it be needed? http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr-internal.h File be/src/runtime/tmp-file-mgr-internal.h: http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr-internal.h@847 PS5, Line 847: /// Shut down the pool before destruction. Is it a loop or a pool? Or both? Mixing them makes it seem like there's a typo. http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.h File be/src/runtime/tmp-file-mgr.h: http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.h@617 PS5, Line 617: uint64_t EnqueueTmpBlockBySpillId(uint64_t spill_id, TmpBlock* tmp_block) { This doesn't require any locking? http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc File be/src/runtime/tmp-file-mgr.cc: http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc@130 PS5, Line 130: "The size should be power of 2 and integer times of the block size. "); Maybe "The size should be a power of 2, and an integer multiple of the block size." Does it really need to be a power of 2? Or does it just need to be an integer multiple of the block size? Would remote_batch_read_max_block_size_level=3, remote_batch_read_tmp_file_size=24M be fine? Also mixing level and MB entries like this seems confusing and unnecessary. http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc@710 PS5, Line 710: // Chcek the unuploaded_files_accessing_ vector to see if the access time of any file nit: Chcek -> Check http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc@712 PS5, Line 712: while (true) { Why isn't this while(it != unuploaded_files_accessing_.end()) { http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc@721 PS5, Line 721: if (!tmp_file->TrySetFullBlocks()) { This could all be if (tmp_file->TrySetFullBlocks()) tmp_file->SetFullBatchReadFileHelper(true /*is_force*/); it = unuploaded_files_accessing_.erase(it); http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc@746 PS5, Line 746: if (!tmp_file->TrySetFullBlocks()) { If you change the previous loop, I'd also invert this conditional. http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc@764 PS5, Line 764: // remote_batch_read_max_block_size_level. For example, it the max level is 3, nit: it -> if http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc@798 PS5, Line 798: >> 1; Is the bit manipulation really worth it over multiplication/division? Seems less clear. http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc@1343 PS5, Line 1343: // Caculate the actual content bytes and file size for the batch reading file. nit: Caculate -> Calculate http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc@1639 PS5, Line 1639: // Prefetch a block with spill id minus one if it is the end of the current spill id. Why is it minus one instead of plus one? http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc@1933 PS5, Line 1933: // tmp_file_mgr_->tmp_dirs_remote_ctrl_.tmp_file_pool_->IsAnyWaiting(); Why is this code here but commented out? http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/service/query-options.cc File be/src/service/query-options.cc: http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/service/query-options.cc@1259 PS5, Line 1259: if (result != StringParser::PARSE_SUCCESS || level < -1 || level > 12) { Why 12 instead of 16 (or 32)? http://gerrit.cloudera.org:8080/#/c/18219/5/common/thrift/Query.thrift File common/thrift/Query.thrift: http://gerrit.cloudera.org:8080/#/c/18219/5/common/thrift/Query.thrift@598 PS5, Line 598: 148: optional i32 remote_batch_read_prefetch_step= 1; nit: other lines in this file put a space before '=' This is limited to 10, could we use a smaller type like byte for it? http://gerrit.cloudera.org:8080/#/c/18219/5/common/thrift/Query.thrift@600 PS5, Line 600: 150: optional i32 base_spill_id_level= 6; This is limited to 12, could we use a smaller type (like byte) for it? -- To view, visit http://gerrit.cloudera.org:8080/18219 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If913785cac9e2dafa20013b6600c87fcaf3e2018 Gerrit-Change-Number: 18219 Gerrit-PatchSet: 5 Gerrit-Owner: Yida Wu <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Yida Wu <[email protected]> Gerrit-Comment-Date: Mon, 20 Jun 2022 23:42:59 +0000 Gerrit-HasComments: Yes
