Michael Smith has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18219 )

Change subject: IMPALA-11064 Optimizing Temporary File Structure for Batch 
Reading
......................................................................


Patch Set 5:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/exec/partitioned-hash-join-builder.h
File be/src/exec/partitioned-hash-join-builder.h:

http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/exec/partitioned-hash-join-builder.h@80
PS5, Line 80:   /// left shift 16 bits for the later specified partition id.
PartitionId is an int, which can be 32-bits. Should we left shift 32 bits here?


http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/exec/partitioned-hash-join-builder.h@82
PS5, Line 82:   static uint64_t GenerateBaseSpillId(int64_t level, uint64_t 
default_id) {
Can you add a small unit test for this?


http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/buffered-tuple-stream.h
File be/src/runtime/buffered-tuple-stream.h:

http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/buffered-tuple-stream.h@422
PS5, Line 422:   void ResetSpillId() { spill_id_ = 0; }
This seems to be unused. When would it be needed?


http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr-internal.h
File be/src/runtime/tmp-file-mgr-internal.h:

http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr-internal.h@847
PS5, Line 847:   /// Shut down the pool before destruction.
Is it a loop or a pool? Or both? Mixing them makes it seem like there's a typo.


http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.h
File be/src/runtime/tmp-file-mgr.h:

http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.h@617
PS5, Line 617:   uint64_t EnqueueTmpBlockBySpillId(uint64_t spill_id, TmpBlock* 
tmp_block) {
This doesn't require any locking?


http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc
File be/src/runtime/tmp-file-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc@130
PS5, Line 130:     "The size should be power of 2 and integer times of the 
block size. ");
Maybe "The size should be a power of 2, and an integer multiple of the block 
size."

Does it really need to be a power of 2? Or does it just need to be an integer 
multiple of the block size? Would remote_batch_read_max_block_size_level=3, 
remote_batch_read_tmp_file_size=24M be fine?

Also mixing level and MB entries like this seems confusing and unnecessary.


http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc@710
PS5, Line 710:   // Chcek the unuploaded_files_accessing_ vector to see if the 
access time of any file
nit: Chcek -> Check


http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc@712
PS5, Line 712:   while (true) {
Why isn't this

    while(it != unuploaded_files_accessing_.end()) {


http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc@721
PS5, Line 721:     if (!tmp_file->TrySetFullBlocks()) {
This could all be

    if (tmp_file->TrySetFullBlocks()) tmp_file->SetFullBatchReadFileHelper(true 
/*is_force*/);
    it = unuploaded_files_accessing_.erase(it);


http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc@746
PS5, Line 746:     if (!tmp_file->TrySetFullBlocks()) {
If you change the previous loop, I'd also invert this conditional.


http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc@764
PS5, Line 764:   // remote_batch_read_max_block_size_level. For example, it the 
max level is 3,
nit: it -> if


http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc@798
PS5, Line 798:       >> 1;
Is the bit manipulation really worth it over multiplication/division? Seems 
less clear.


http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc@1343
PS5, Line 1343:   // Caculate the actual content bytes and file size for the 
batch reading file.
nit: Caculate -> Calculate


http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc@1639
PS5, Line 1639:     // Prefetch a block with spill id minus one if it is the 
end of the current spill id.
Why is it minus one instead of plus one?


http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/runtime/tmp-file-mgr.cc@1933
PS5, Line 1933:     //    
tmp_file_mgr_->tmp_dirs_remote_ctrl_.tmp_file_pool_->IsAnyWaiting();
Why is this code here but commented out?


http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/service/query-options.cc
File be/src/service/query-options.cc:

http://gerrit.cloudera.org:8080/#/c/18219/5/be/src/service/query-options.cc@1259
PS5, Line 1259:         if (result != StringParser::PARSE_SUCCESS || level < -1 
|| level > 12) {
Why 12 instead of 16 (or 32)?


http://gerrit.cloudera.org:8080/#/c/18219/5/common/thrift/Query.thrift
File common/thrift/Query.thrift:

http://gerrit.cloudera.org:8080/#/c/18219/5/common/thrift/Query.thrift@598
PS5, Line 598:   148: optional i32 remote_batch_read_prefetch_step= 1;
nit: other lines in this file put a space before '='

This is limited to 10, could we use a smaller type like byte for it?


http://gerrit.cloudera.org:8080/#/c/18219/5/common/thrift/Query.thrift@600
PS5, Line 600:   150: optional i32 base_spill_id_level= 6;
This is limited to 12, could we use a smaller type (like byte) for it?



--
To view, visit http://gerrit.cloudera.org:8080/18219
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If913785cac9e2dafa20013b6600c87fcaf3e2018
Gerrit-Change-Number: 18219
Gerrit-PatchSet: 5
Gerrit-Owner: Yida Wu <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Yida Wu <[email protected]>
Gerrit-Comment-Date: Mon, 20 Jun 2022 23:42:59 +0000
Gerrit-HasComments: Yes

Reply via email to