Michael Smith has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18219 )

Change subject: IMPALA-11064 Optimizing Temporary File Structure for Batch 
Reading
......................................................................


Patch Set 8:

(18 comments)

Did another pass through. Looks pretty good. I had some more minor comments.

http://gerrit.cloudera.org:8080/#/c/18219/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18219/8//COMMIT_MSG@67
PS8, Line 67: a block with spill id 1 and sequence number 1.
It's not really clear to me what would happen if step number is other than 1, 
or why you'd want to try something different. Can you add other examples?


http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/io/disk-file.h
File be/src/runtime/io/disk-file.h:

http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/io/disk-file.h@256
PS8, Line 256:     return file_writer_->written_bytes();
Any chance this should also be under lock_guard? Looks like probably not 
because it's only used in TmpFileUploadThreadLoop.


http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/io/disk-io-mgr.cc
File be/src/runtime/io/disk-io-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/io/disk-io-mgr.cc@291
PS8, Line 291:     DCHECK_LE(written_bytes, disk_file_->actual_content_bytes());
Why call actual_content_bytes() again (which uses a lock) rather than compare 
to actual_file_size?


http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/io/disk-io-mgr.cc@413
PS8, Line 413: Status RemoteOperRange::DoFetch(RequestContext* io_ctx) {
Why is io_ctx added? It doesn't seem to be used at all.


http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/io/disk-io-mgr.cc@435
PS8, Line 435:     // Return okay because the the remote file is deleted 
probably when the query is
nit: "the the", and I'd remove "probably" unless there's some sort of race 
condition here that we're covering (in which case I'd be clearer about it).


http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/io/request-context.h
File be/src/runtime/io/request-context.h:

http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/io/request-context.h@195
PS8, Line 195:   void set_read_use_local_disk_batch_counter(
nit: it'd be nice if order of these functions added order of the counter 
variables later.


http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/tmp-file-mgr-internal.h
File be/src/runtime/tmp-file-mgr-internal.h:

http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/tmp-file-mgr-internal.h@67
PS8, Line 67:       is_full_ = true;
Why does this set is_full_ when it doesn't actually allocate the space? If a 
smaller num_bytes request came in, you could presumably still add space?

I guess this is because we want blocks to be filled sequentially by spill id 
sequence?


http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/tmp-file-mgr-test.cc
File be/src/runtime/tmp-file-mgr-test.cc:

http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/tmp-file-mgr-test.cc@334
PS8, Line 334:       TmpFileGroup& file_group, int idx, bool batch_read = 
false) {
Planning to add tests where batch_read is false?


http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/tmp-file-mgr-test.cc@2148
PS8, Line 2148: TEST_F(TmpFileMgrTest, TestBatchReadFromRemote) {
I wonder if this whole test should be duplicated, so we can have both 
batch_read=true and batch_read=false.


http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/tmp-file-mgr.h
File be/src/runtime/tmp-file-mgr.h:

http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/tmp-file-mgr.h@521
PS8, Line 521:       int remote_batch_read_prefetch_step = 4, int 
remote_batch_buffer_file_limit = 0);
These default values are only used in testing, and 
remote_batch_read_prefetch_step doesn't match the startup default. Seems odd to 
set them.


http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/tmp-file-mgr.cc
File be/src/runtime/tmp-file-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/tmp-file-mgr.cc@457
PS8, Line 457:   EnqueueAllDummyFiles();
I'm not really sure why this is needed. Reading the function comments doesn't 
help. An overview on how dummy files are used in TmpFileMgr would help.


http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/tmp-file-mgr.cc@545
PS8, Line 545:     const unique_lock<shared_mutex>& file_lock, TmpFile* 
tmp_file) {
This function is only called one place, where we know tmp_file is a 
TempFileRemote.


http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/tmp-file-mgr.cc@549
PS8, Line 549:   if (!tmp_file_remote->UseBatchReading()) return;
This check redundant because we called it in TempFileRemote right before the 
only place this function is called.


http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/tmp-file-mgr.cc@608
PS8, Line 608:   if (cur_bytes + file_size <= dir->bytes_limit_) {
I'm not clear what this pattern is doing. I see it here and somewhat similar in 
EnqueueAllDummyFiles above.


http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/tmp-file-mgr.cc@831
PS8, Line 831:   // Limit the number of files for batch reading to half of the 
local capacity by default.
nit: "by default" seems misleading here. I don't see any way to change this 
calculation.


http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/runtime/tmp-file-mgr.cc@1362
PS8, Line 1362:   if (GetWriteFile()->actual_content_bytes() > 0) {
Personally I'd rather call actual_content_bytes once and store it locally for 
comparison and assignment to avoid the spinlock. But it's probably not a big 
deal.

On my wish list: C++17 for initializer statements in if statements.


http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/service/query-options.h
File be/src/service/query-options.h:

http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/service/query-options.h@279
PS8, Line 279:   QUERY_OPT_FN(remote_batch_read_prefetch_step, 
REMOTE_BATCH_READ_PREFETCH_STEP,         \
When will docs for these new query options be added?


http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/service/query-options.cc
File be/src/service/query-options.cc:

http://gerrit.cloudera.org:8080/#/c/18219/8/be/src/service/query-options.cc@1255
PS8, Line 1255:                          "Only integer value larger than or 
equals to 0 is allowed.",
nit: "equal" instead of "equals"



--
To view, visit http://gerrit.cloudera.org:8080/18219
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If913785cac9e2dafa20013b6600c87fcaf3e2018
Gerrit-Change-Number: 18219
Gerrit-PatchSet: 8
Gerrit-Owner: Yida Wu <[email protected]>
Gerrit-Reviewer: Abhishek Rawat <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Yida Wu <[email protected]>
Gerrit-Comment-Date: Thu, 06 Oct 2022 23:51:25 +0000
Gerrit-HasComments: Yes

Reply via email to