Yida Wu has posted comments on this change. ( http://gerrit.cloudera.org:8080/18219 )
Change subject: IMPALA-11064 Optimizing Temporary File Structure for Batch Reading ...................................................................... Patch Set 10: (29 comments) http://gerrit.cloudera.org:8080/#/c/18219/9//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/18219/9//COMMIT_MSG@12 PS9, Line 12: With the patch, there will be two types of structures, one is the > nit with the patch Done http://gerrit.cloudera.org:8080/#/c/18219/9//COMMIT_MSG@12 PS9, Line 12: s the > nit. allocating Done http://gerrit.cloudera.org:8080/#/c/18219/9//COMMIT_MSG@52 PS9, Line 52: ew start option: > I wonder why not spell out the block size directly, as remote_batch_read_bl Changed to remote_batch_read_block_size_level. Keep the level to indicate it is not a direct size value. http://gerrit.cloudera.org:8080/#/c/18219/9//COMMIT_MSG@63 PS9, Line 63: ew query options: > A slight better name would be remote_batch_read_prefetch_size. The idea for the perfetch here is to set the step instead of the size to be fetched. The step means how far from the current block to be prefetched, but only one block no matter how far. Size may indicate how many blocks to be prefetched, however, I think this can be done by changing the block size. http://gerrit.cloudera.org:8080/#/c/18219/9//COMMIT_MSG@74 PS9, Line 74: > If I understand correctly, max number of temp files = 2^level times # of pa That is a good point. In fact remote_batch_buffer_file_limit limits the max number of temp files in local disk, however it doesn't work well during the test because it replies on the timeout mechanism to force upload when limitation reached and the files haven't been fully filled (that happens a lot), it is very slow. The idea of base_spill_id_level is to combine partitions from different partitioned nodes randomly, so that the half-filled blocks or files would be much less. Considered combining partitions from the same node before, but in the test, using the current way is more efficient. http://gerrit.cloudera.org:8080/#/c/18219/9//COMMIT_MSG@103 PS9, Line 103: Only works for the files for batch reading. > May need to say a few sentences with performance numbers here to indicate t Done http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-builder.h File be/src/exec/partitioned-hash-join-builder.h: http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-builder.h@82 PS9, Line 82: int64_t level, uint64_t default_id > I wonder if fragment instance should play a role here. In the mt_dop > 0 ca I think in that case, multiple instances but with the same partition id could share the same block id (spill id). The question is if the space (block number) is limited, what is merge policy? The common way is to merge certain partitions of the same instance to the same block. I tried that at first, and found that to merge the same partition index but different instances is more efficient while doing the read. Writes some of my thinking on this in the next comment. But maybe more tests are needed to confirm. http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-builder.h@84 PS9, Line 84: 16; > This limits max # of partitions to be 2^16 - 1, which could be very limitin If my understanding is correct, the partition number here relies on the PARTITION_FANOUT of the PHJ node, which is 16, should be different from the table partitions. The current idea is based on the spilling of different partitions can be concurrent, however the reading is always one by one within the same node. So it separates the partition id off the base spill id to make different partition data of the same node stored in different blocks instead of mixing them together. Mod the max temporary file number should be similar to setting the base spill id. But the base spill id is configurable. I think the challenge is very difficult to find out the optimal number, max temporary file number could be a candidate, but because batch reading files shares the space with the ordinary temp files which are not partitioned hash join node, so it depends on the global workload. In general, we want to have more batch reading files to read faster while not too many that may block and slow down the writes. While base spill id (spill id level) is configurable, we could do more tests to see what is pattern of the optimal number for different workloads. http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-builder.h@86 PS9, Line 86: DCHECK_GE(level, 0); > nit. this probably is nor necessary because of line 83. Done http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-builder.cc File be/src/exec/partitioned-hash-join-builder.cc: http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-builder.cc@194 PS9, Line 194: // status. : if (probe_barrier_ != nullptr) state->AddBarrierToCancel(probe_b > nit. This line of code probably should be part of the member initialization Done http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-builder.cc@225 PS9, Line 225: filter_ctxs_.back().filter = state->filter_bank()->RegisterProducer(filter_desc); : } > nit. same comment as above. Done http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-node.cc File be/src/exec/partitioned-hash-join-node.cc: http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-node.cc@123 PS9, Line 123: process_probe_batch_fn_level0_(pnode.process_probe_batch_fn_level0_) { : memset(hash_tbls_, 0, sizeof(HashTable*) * PARTITION_FANOUT); > nit. Move to member initialization section. Done http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/io/local-file-writer.cc File be/src/runtime/io/local-file-writer.cc: http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/io/local-file-writer.cc@66 PS9, Line 66: WriteOne > nit. Probably renamed as WriteOneRange(). The name has been there for a while, and the rename requires changes in quit a few places. May change it later. http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/io/local-file-writer.cc@77 PS9, Line 77: if (ret_status.ok()) { > nit. This line can be moved inside the next block to avoid check ret_status Done http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h File be/src/runtime/tmp-file-mgr-internal.h: http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@81 PS9, Line 81: set_is_f > nt. set_is_ful(). Done http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@109 PS9, Line 109: : /// blocks with > nit. "among all blocks with the same spill id". Done http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@124 PS9, Line 124: offset_ > nit. offset_within_block_ Done http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@129 PS9, Line 129: /// The id of the block in a temporary file, and it is 0-based. > nit. 0-based Done http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@427 PS9, Line 427: int > nit uint_32 Uses int because it leaves a room for negative value to indicate special meanings. Added a check for greater or equal to 0 for block id. http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@429 PS9, Line 429: DCHECK_LT(block_id, file_mgr_->GetNumReadBuffersPerFile()) > nit. may augment with an error message. Done http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@438 PS9, Line 438: eturn true only > nit. May rename as IncreaseFullBlockCounter() Done http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@450 PS9, Line 450: TrySetFullBlocks() { > nit LIKELY() Done http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@485 PS9, Line 485: > nit. Do we return false also when the allocation fails? May need to state t Assert the space allocation from a new block successful, because we expect a page can't be lager than a block. Added comments. http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@485 PS9, Line 485: > nit remove Done http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@591 PS9, Line 591: e read buf > nt. Do we need to DCHECK_LE() first? Done http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@867 PS9, Line 867: > nit being destroyed. Done http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@890 PS9, Line 890: gro > nit. may renamed as resource_bytes changed to resource_val, because it may not be bytes. http://gerrit.cloudera.org:8080/#/c/18219/9/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/18219/9/common/thrift/ImpalaService.thrift@744 PS9, Line 744: REMOTE_BATCH_READ_PREFETCH_STEP = 148 : : // Maximum number of file > Need to provide a specific comment for each new option. Done http://gerrit.cloudera.org:8080/#/c/18219/9/common/thrift/Query.thrift File common/thrift/Query.thrift: http://gerrit.cloudera.org:8080/#/c/18219/9/common/thrift/Query.thrift@601 PS9, Line 601: // See comment in ImpalaService.thrift : 149: optional i32 remote_batch_buffer_file_limit = 0; : 150: optional i32 remote_batch_read_prefetch_step = 1; : 151: optional i32 base_spill_id_level = 6; > better to follow the existing convention above. Done -- To view, visit http://gerrit.cloudera.org:8080/18219 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If913785cac9e2dafa20013b6600c87fcaf3e2018 Gerrit-Change-Number: 18219 Gerrit-PatchSet: 10 Gerrit-Owner: Yida Wu <[email protected]> Gerrit-Reviewer: Abhishek Rawat <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Yida Wu <[email protected]> Gerrit-Comment-Date: Fri, 18 Nov 2022 00:04:58 +0000 Gerrit-HasComments: Yes
