Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18219 )

Change subject: IMPALA-11064 Optimizing Temporary File Structure for Batch 
Reading
......................................................................


Patch Set 9:

(29 comments)

Looks good!

It seems there is still room to simplify the new query options. That is the 
interface to the new temp file scheme.

That to be able to handle many many partitions is also a key.

http://gerrit.cloudera.org:8080/#/c/18219/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18219/9//COMMIT_MSG@12
PS9, Line 12: allocate
nit. allocating


http://gerrit.cloudera.org:8080/#/c/18219/9//COMMIT_MSG@12
PS9, Line 12: There will be two types of structures, one is the original, 
allocate
nit with the patch


http://gerrit.cloudera.org:8080/#/c/18219/9//COMMIT_MSG@52
PS9, Line 52: remote_batch_read_max_block_size_level
I wonder why not spell out the block size directly, as 
remote_batch_read_block_size. Simplicity is better.


http://gerrit.cloudera.org:8080/#/c/18219/9//COMMIT_MSG@63
PS9, Line 63: remote_batch_read_prefetch_step
A slight better name would be remote_batch_read_prefetch_size.

The option specifies the number of subsequent blocks to prefetch.


http://gerrit.cloudera.org:8080/#/c/18219/9//COMMIT_MSG@74
PS9, Line 74: base_spill_id_level
If I understand correctly, max number of temp files = 2^level times # of 
partitions. When there are many partitions, this number could still be large.

Would it be possible to provide an option on max number of temp files directly?


http://gerrit.cloudera.org:8080/#/c/18219/9//COMMIT_MSG@103
PS9, Line 103:
May need to say a few sentences with performance numbers here to indicate the 
new scheme is better.


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-builder.h
File be/src/exec/partitioned-hash-join-builder.h:

http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-builder.h@82
PS9, Line 82: int64_t level, uint64_t default_id
I wonder if fragment instance should play a role here. In the mt_dop > 0 case, 
multiple instances can be running on a single node. In this case, it may be 
useful to place spills from fragment instances on different temp files.


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-builder.h@84
PS9, Line 84: 16;
This limits max # of partitions to be 2^16 - 1, which could be very limiting 
for tables with range partitions on weeks or days.

Would it be possible to use the following existing code directly?  The inputs 
are a sequence of uints: default_id, partition_id, and possibly fragment 
instance id. And then mod on the max number of temp files (see my comment in 
commit message).

                                                                                
                                                                             
 91   /// CrcHash() specialized for 8-byte data     
 92   static inline uint32_t CrcHash8(const void* v, uint32_t hash) {
 93     DCHECK(CpuInfo::IsSupported(CpuInfo::SSE4_2) || base::IsAarch64());     
     
 94     const uint64_t* p = reinterpret_cast<const uint64_t*>(v);
 95     hash = SSE4_crc32_u64(hash, *p);
 96     hash = (hash << 16) | (hash >> 16);                                     
           
 97     return hash;                                                  
 98   }
util/hash-util.h


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-builder.h@86
PS9, Line 86:  DCHECK_GE(level, 0);
nit. this probably is nor necessary because of line 83.


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-builder.cc
File be/src/exec/partitioned-hash-join-builder.cc:

http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-builder.cc@194
PS9, Line 194: base_spill_id_ = PhjBuilderConfig::GenerateBaseSpillId(
             :       state->query_options().base_spill_id_level, 
(uint64_t)this);
nit. This line of code probably should be part of the member initialization 
section at line 180.


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-builder.cc@225
PS9, Line 225:  base_spill_id_ = PhjBuilderConfig::GenerateBaseSpillId(
             :       state->query_options().base_spill_id_level, 
(uint64_t)this);
nit. same comment as above.


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-node.cc@123
PS9, Line 123: base_spill_id_ = PhjBuilderConfig::GenerateBaseSpillId(
             :       state->query_options().base_spill_id_level, 
(uint64_t)this);
nit. Move to member initialization section.


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/io/local-file-writer.cc
File be/src/runtime/io/local-file-writer.cc:

http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/io/local-file-writer.cc@66
PS9, Line 66: WriteOne
nit. Probably renamed as WriteOneRange().


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/io/local-file-writer.cc@77
PS9, Line 77: if (ret_status.ok() && !close_status.ok()) ret_status = 
close_status;
nit. This line can be moved inside the next block to avoid check 
ret_status.ok() twice.


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h
File be/src/runtime/tmp-file-mgr-internal.h:

http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@81
PS9, Line 81: set_full
nt. set_is_ful().


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@109
PS9, Line 109: with
             :   /// the specific
nit. "among all blocks with the same spill id".


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@124
PS9, Line 124: offset_
nit. offset_within_block_


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@129
PS9, Line 129:   /// The id of the block in a temporary file. For example, if 
the block is the first one
nit. 0-based


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@427
PS9, Line 427: int
nit uint_32


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@429
PS9, Line 429: DCHECK_LT(block_id, file_mgr_->GetNumReadBuffersPerFile());
nit. may augment with an error message.

DCHECK_LT(block_id, file_mgr_->GetNumReadBuffersPerFile()) << "block_id is out 
of range.";


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@438
PS9, Line 438: IncFullBlockStat
nit. May rename as IncreaseFullBlockCounter()


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@450
PS9, Line 450: (full_block_cnt_ < file_mgr_->GetNumReadBuffersPerFile()
nit LIKELY()


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@485
PS9, Line 485: should be allocated
nit. Do we return false also when the allocation fails? May need to state that 
fact in the comment.


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@485
PS9, Line 485: would
nit remove


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@591
PS9, Line 591: buffer_idx
nt. Do we need to DCHECK_LE() first?


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@867
PS9, Line 867: destroying
nit being destroyed.


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@890
PS9, Line 890: val
nit. may renamed as resource_bytes


http://gerrit.cloudera.org:8080/#/c/18219/9/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/18219/9/common/thrift/ImpalaService.thrift@744
PS9, Line 744:   REMOTE_BATCH_READ_PREFETCH_STEP = 148
             :   REMOTE_BATCH_BUFFER_FILE_LIMIT = 149
             :   BASE_SPILL_ID_LEVEL = 150
Need to provide a specific comment for each new option.


http://gerrit.cloudera.org:8080/#/c/18219/9/common/thrift/Query.thrift
File common/thrift/Query.thrift:

http://gerrit.cloudera.org:8080/#/c/18219/9/common/thrift/Query.thrift@601
PS9, Line 601:   // Control the batch reading of temporary files.
             :   149: optional i32 remote_batch_buffer_file_limit = 0;
             :   150: optional i32 remote_batch_read_prefetch_step = 1;
             :   151: optional i32 base_spill_id_level = 6;
better to follow the existing convention above.



--
To view, visit http://gerrit.cloudera.org:8080/18219
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If913785cac9e2dafa20013b6600c87fcaf3e2018
Gerrit-Change-Number: 18219
Gerrit-PatchSet: 9
Gerrit-Owner: Yida Wu <[email protected]>
Gerrit-Reviewer: Abhishek Rawat <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Yida Wu <[email protected]>
Gerrit-Comment-Date: Fri, 21 Oct 2022 15:35:03 +0000
Gerrit-HasComments: Yes

Reply via email to