Yida Wu has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18219 )

Change subject: IMPALA-11064 Optimizing Temporary File Structure for Batch 
Reading
......................................................................


Patch Set 10:

(29 comments)

http://gerrit.cloudera.org:8080/#/c/18219/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18219/9//COMMIT_MSG@12
PS9, Line 12: With the patch, there will be two types of structures, one is the
> nit with the patch
Done


http://gerrit.cloudera.org:8080/#/c/18219/9//COMMIT_MSG@12
PS9, Line 12: s the
> nit. allocating
Done


http://gerrit.cloudera.org:8080/#/c/18219/9//COMMIT_MSG@52
PS9, Line 52: ew start option:
> I wonder why not spell out the block size directly, as remote_batch_read_bl
Changed to remote_batch_read_block_size_level. Keep the level to indicate it is 
not a direct size value.


http://gerrit.cloudera.org:8080/#/c/18219/9//COMMIT_MSG@63
PS9, Line 63: ew query options:
> A slight better name would be remote_batch_read_prefetch_size.
The idea for the perfetch here is to set the step instead of the size to be 
fetched.
The step means how far from the current block to be prefetched, but only one 
block no matter how far.
Size may indicate how many blocks to be prefetched, however, I think this can 
be done by changing the block size.


http://gerrit.cloudera.org:8080/#/c/18219/9//COMMIT_MSG@74
PS9, Line 74:
> If I understand correctly, max number of temp files = 2^level times # of pa
That is a good point. In fact remote_batch_buffer_file_limit limits the max 
number of temp files in local disk, however it doesn't work well during the 
test because it replies on the timeout mechanism to force upload when 
limitation reached and the files haven't been fully filled (that happens a 
lot), it is very slow. The idea of base_spill_id_level is to combine partitions 
from different partitioned nodes randomly, so that the half-filled blocks or 
files would be much less. Considered combining partitions from the same node 
before, but in the test, using the current way is more efficient.


http://gerrit.cloudera.org:8080/#/c/18219/9//COMMIT_MSG@103
PS9, Line 103: Only works for the files for batch reading.
> May need to say a few sentences with performance numbers here to indicate t
Done


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-builder.h
File be/src/exec/partitioned-hash-join-builder.h:

http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-builder.h@82
PS9, Line 82: int64_t level, uint64_t default_id
> I wonder if fragment instance should play a role here. In the mt_dop > 0 ca
I think in that case, multiple instances but with the same partition id could 
share the same block id (spill id). The question is if the space (block number) 
is limited, what is merge policy? The common way is to merge certain partitions 
of the same instance to the same block. I tried that at first, and found that 
to merge the same partition index but different instances is more efficient 
while doing the read. Writes some of my thinking on this in the next comment. 
But maybe more tests are needed to confirm.


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-builder.h@84
PS9, Line 84: 16;
> This limits max # of partitions to be 2^16 - 1, which could be very limitin
If my understanding is correct, the partition number here relies on the 
PARTITION_FANOUT of the PHJ node, which is 16, should be different from the 
table partitions.

The current idea is based on the spilling of different partitions can be 
concurrent, however the reading is always one by one within the same node. So 
it separates the partition id off the base spill id to make different partition 
data of the same node stored in different blocks instead of mixing them 
together.

Mod the max temporary file number should be similar to setting the base spill 
id. But the base spill id is configurable.

I think the challenge is very difficult to find out the optimal number, max 
temporary file number could be a candidate, but because batch reading files 
shares the space with the ordinary temp files which are not partitioned hash 
join node, so it depends on the global workload. In general, we want to have 
more batch reading files to read faster while not too many that may block and 
slow down the writes.

While base spill id (spill id level) is configurable, we could do more tests to 
see what is pattern of the optimal number for different workloads.


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-builder.h@86
PS9, Line 86:  DCHECK_GE(level, 0);
> nit. this probably is nor necessary because of line 83.
Done


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-builder.cc
File be/src/exec/partitioned-hash-join-builder.cc:

http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-builder.cc@194
PS9, Line 194: // status.
             :   if (probe_barrier_ != nullptr) 
state->AddBarrierToCancel(probe_b
> nit. This line of code probably should be part of the member initialization
Done


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-builder.cc@225
PS9, Line 225:    filter_ctxs_.back().filter = 
state->filter_bank()->RegisterProducer(filter_desc);
             :   }
> nit. same comment as above.
Done


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/exec/partitioned-hash-join-node.cc@123
PS9, Line 123:   
process_probe_batch_fn_level0_(pnode.process_probe_batch_fn_level0_) {
             :   memset(hash_tbls_, 0, sizeof(HashTable*) * PARTITION_FANOUT);
> nit. Move to member initialization section.
Done


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/io/local-file-writer.cc
File be/src/runtime/io/local-file-writer.cc:

http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/io/local-file-writer.cc@66
PS9, Line 66: WriteOne
> nit. Probably renamed as WriteOneRange().
The name has been there for a while, and the rename requires changes in quit a 
few places. May change it later.


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/io/local-file-writer.cc@77
PS9, Line 77: if (ret_status.ok()) {
> nit. This line can be moved inside the next block to avoid check ret_status
Done


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h
File be/src/runtime/tmp-file-mgr-internal.h:

http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@81
PS9, Line 81: set_is_f
> nt. set_is_ful().
Done


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@109
PS9, Line 109:
             :   /// blocks with
> nit. "among all blocks with the same spill id".
Done


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@124
PS9, Line 124: offset_
> nit. offset_within_block_
Done


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@129
PS9, Line 129:   /// The id of the block in a temporary file, and it is 0-based.
> nit. 0-based
Done


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@427
PS9, Line 427: int
> nit uint_32
Uses int because it leaves a room for negative value to indicate special 
meanings. Added a check for greater or equal to 0 for block id.


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@429
PS9, Line 429: DCHECK_LT(block_id, file_mgr_->GetNumReadBuffersPerFile())
> nit. may augment with an error message.
Done


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@438
PS9, Line 438: eturn true only
> nit. May rename as IncreaseFullBlockCounter()
Done


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@450
PS9, Line 450: TrySetFullBlocks() {
> nit LIKELY()
Done


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@485
PS9, Line 485:
> nit. Do we return false also when the allocation fails? May need to state t
Assert the space allocation from a new block successful, because we expect a 
page can't be lager than a block. Added comments.


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@485
PS9, Line 485:
> nit remove
Done


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@591
PS9, Line 591: e read buf
> nt. Do we need to DCHECK_LE() first?
Done


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@867
PS9, Line 867:
> nit being destroyed.
Done


http://gerrit.cloudera.org:8080/#/c/18219/9/be/src/runtime/tmp-file-mgr-internal.h@890
PS9, Line 890: gro
> nit. may renamed as resource_bytes
changed to resource_val, because it may not be bytes.


http://gerrit.cloudera.org:8080/#/c/18219/9/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/18219/9/common/thrift/ImpalaService.thrift@744
PS9, Line 744:   REMOTE_BATCH_READ_PREFETCH_STEP = 148
             :
             :   // Maximum number of file
> Need to provide a specific comment for each new option.
Done


http://gerrit.cloudera.org:8080/#/c/18219/9/common/thrift/Query.thrift
File common/thrift/Query.thrift:

http://gerrit.cloudera.org:8080/#/c/18219/9/common/thrift/Query.thrift@601
PS9, Line 601:   // See comment in ImpalaService.thrift
             :   149: optional i32 remote_batch_buffer_file_limit = 0;
             :   150: optional i32 remote_batch_read_prefetch_step = 1;
             :   151: optional i32 base_spill_id_level = 6;
> better to follow the existing convention above.
Done



--
To view, visit http://gerrit.cloudera.org:8080/18219
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If913785cac9e2dafa20013b6600c87fcaf3e2018
Gerrit-Change-Number: 18219
Gerrit-PatchSet: 10
Gerrit-Owner: Yida Wu <[email protected]>
Gerrit-Reviewer: Abhishek Rawat <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Yida Wu <[email protected]>
Gerrit-Comment-Date: Fri, 18 Nov 2022 00:04:58 +0000
Gerrit-HasComments: Yes

Reply via email to