Yida Wu has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17979 )

Change subject: IMPALA-10791 Add batching reading for remote temporary files
......................................................................


Patch Set 6:

(18 comments)

Thanks Qifan for the detailed review.
Changed the name to "read buffer block", previously was "read buffer" or "mem 
block", and MemBlock is the implementation of the "read buffer block".

http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h
File be/src/runtime/io/disk-file.h:

http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@52
PS5, Line 52:
> Need a description for MemBlock class.
Done


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@62
PS5, Line 62:
> nit. Return whether the memory is reserved or allocated before deletion to
Done


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@68
PS5, Line 68:
> nit. may switch the order of the two arguments as the lock appears first in
Done


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@168
PS5, Line 168:   /// The status of the memory block.
> nit. May describe the content of a memory block a little bit. For example,
Done


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@200
PS5, Line 200:   DiskFile(const std::string& path, DiskIoMgr* io_mgr, int64_t 
file_size,
> nit missing the comment.
Done


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@206
PS5, Line 206:
> Since this class is nested within DiskFile class, suggest to call it ReadBu
Done


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@206
PS5, Line 206:
             :   virtual ~DiskFile() {}
> A little bit confusing on how these two names relate to mem blocks. See my
Previously, a read buffer and a memory block are the same thing in our patch. 
Changed to name it as "read buffer block", because compared to "memory block", 
the TmpFileMgr may be easier to know what the block is used for by its name. 
And the block may not be necessarily in memory, at first we think to have it in 
disk, the name could be easier to extend if later we have some different 
implementation.
Added some comments, hope to make it clearer.


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@210
PS5, Line 210:  page, each time
> Suggest to use a name relate to memory block, say mem_block_size_?
Changed to read_buffer_block_size_.


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@213
PS5, Line 213: ilable or not.
> suggest to rename to num_mem_blocks_.
Changed to num_of_read_buffer_blocks_.


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@326
PS5, Line 326: id UpdateReadBufferMetaData(int64_t offset) {
> I saw the IF check below which should take care of index being 12. Since th
It depends on the file size.
For example, if the file size is 128MB, block size is 8MB. Then there will be 
128/8 = 16 mem blocks per file.
Therefore, there is no limitation that to be only 4 blocks per file, the number 
of blocks is decided by default file size and default block size. In this case, 
the last index is 15 for offset over 120MB.

There could be a case that the actual file size is over the default one a 
little bit, but it is assumed no more than a page size. For example, it could 
be 129MB for the file, therefore, the actual size of the last block can be 9MB, 
but it is not a huge difference compared to other blocks.

The number of read blocks is calculated here:
https://gerrit.cloudera.org/#/c/17979/6/be/src/runtime/tmp-file-mgr.cc@411


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.cc
File be/src/runtime/io/disk-file.cc:

http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.cc@116
PS5, Line 116:
> Since the memory is actually deleted in this method, maybe we should call t
Think for a while, maybe DISABLED is better, because it means the block doesn't 
allow to use anymore. DELETED may mean sth has been deleted, but in our case, 
it doesn't necessarily delete anything for being the state.


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-io-mgr.cc
File be/src/runtime/io/disk-io-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-io-mgr.cc@396
PS5, Line 396: // Fetch the data from the source file (remote) to the 
destination file (local).
             :   DCHECK(disk_file_dst_ != nullptr);
> May need to add a comment to describe these two files. An example would be
Done


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-io-mgr.cc@424
PS5, Line 424:           read_buffer_bloc, MemBlockStatus::DISABLED, dstl, 
&read_buffer_lock)) {
> Why can't we use this DISABLED memory block here?  Looks like we should not
Because the memory of the block is released, if this case happens, we can't use 
the memory block unless we can reserve the memory again. But we don't allow the 
block status to go back from DISABLED to RESERVED, and the case when DISABLED 
happens, it could be that the query is finished or cancelled, or all the pages 
of the block are read, all cases should have no need to fetch anything anymore.
Added some comments.


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-io-mgr.cc@436
PS5, Line 436: atus = Status(TErrorCode::DISK_IO_ERROR, GetBackendStri
> This is not used.
The read timer works when it is destructed. It counts the time interval during 
its life cycle.


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-io-mgr.cc@445
PS5, Line 445: queue->read_size()->Update(local_file_size);
             :       disk_file_dst_->SetReadBufferBlockStatus(
> Should we set the memory block allocated above to DISABLED here so that it
Do you mean delete the memory block if fails? Right now it is handled by the 
callback function in TmpFile, because the deletion of a block needs to add some 
number to the TmpFileMgr metrics according to different block status, and the 
DiskIoMgr isn't aware of TmpFileMgr's structure.
So I guess the first question is the same that the callback function would set 
the block to DISABLED if fails.

The failure handle is here:
https://gerrit.cloudera.org/#/c/17979/6/be/src/runtime/tmp-file-mgr.cc@1040
Added some comments.


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/tmp-file-mgr.h
File be/src/runtime/tmp-file-mgr.h:

http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/tmp-file-mgr.h@147
PS5, Line 147: read_buffer_block_size_;
> nit. This name sounds like the buffer is remote. Maybe renamed as mem_block
Changed to read_buffer_block_size_.


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/tmp-file-mgr.h@151
PS5, Line 151: num_read_buffer_blocks_per_file_;
> similar comment as above. num_mem_blocks_per_file instead?
Changed to num_read_buffer_blocks_per_file_.


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/tmp-file-mgr.h@158
PS5, Line 158: max_read_buffer_size_;
> max_mem_block_size_?
Changed to max_read_buffer_size_.



--
To view, visit http://gerrit.cloudera.org:8080/17979
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1dcc5d0881ffaeff09c5c514306cd668373ad31b
Gerrit-Change-Number: 17979
Gerrit-PatchSet: 6
Gerrit-Owner: Yida Wu <wydbaggio...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Yida Wu <wydbaggio...@gmail.com>
Gerrit-Comment-Date: Tue, 09 Nov 2021 17:44:02 +0000
Gerrit-HasComments: Yes

Reply via email to