Yida Wu has posted comments on this change. ( http://gerrit.cloudera.org:8080/17979 )
Change subject: IMPALA-10791 Add batching reading for remote temporary files ...................................................................... Patch Set 6: (18 comments) Thanks Qifan for the detailed review. Changed the name to "read buffer block", previously was "read buffer" or "mem block", and MemBlock is the implementation of the "read buffer block". http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h File be/src/runtime/io/disk-file.h: http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@52 PS5, Line 52: > Need a description for MemBlock class. Done http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@62 PS5, Line 62: > nit. Return whether the memory is reserved or allocated before deletion to Done http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@68 PS5, Line 68: > nit. may switch the order of the two arguments as the lock appears first in Done http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@168 PS5, Line 168: /// The status of the memory block. > nit. May describe the content of a memory block a little bit. For example, Done http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@200 PS5, Line 200: DiskFile(const std::string& path, DiskIoMgr* io_mgr, int64_t file_size, > nit missing the comment. Done http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@206 PS5, Line 206: > Since this class is nested within DiskFile class, suggest to call it ReadBu Done http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@206 PS5, Line 206: : virtual ~DiskFile() {} > A little bit confusing on how these two names relate to mem blocks. See my Previously, a read buffer and a memory block are the same thing in our patch. Changed to name it as "read buffer block", because compared to "memory block", the TmpFileMgr may be easier to know what the block is used for by its name. And the block may not be necessarily in memory, at first we think to have it in disk, the name could be easier to extend if later we have some different implementation. Added some comments, hope to make it clearer. http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@210 PS5, Line 210: page, each time > Suggest to use a name relate to memory block, say mem_block_size_? Changed to read_buffer_block_size_. http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@213 PS5, Line 213: ilable or not. > suggest to rename to num_mem_blocks_. Changed to num_of_read_buffer_blocks_. http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@326 PS5, Line 326: id UpdateReadBufferMetaData(int64_t offset) { > I saw the IF check below which should take care of index being 12. Since th It depends on the file size. For example, if the file size is 128MB, block size is 8MB. Then there will be 128/8 = 16 mem blocks per file. Therefore, there is no limitation that to be only 4 blocks per file, the number of blocks is decided by default file size and default block size. In this case, the last index is 15 for offset over 120MB. There could be a case that the actual file size is over the default one a little bit, but it is assumed no more than a page size. For example, it could be 129MB for the file, therefore, the actual size of the last block can be 9MB, but it is not a huge difference compared to other blocks. The number of read blocks is calculated here: https://gerrit.cloudera.org/#/c/17979/6/be/src/runtime/tmp-file-mgr.cc@411 http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.cc File be/src/runtime/io/disk-file.cc: http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.cc@116 PS5, Line 116: > Since the memory is actually deleted in this method, maybe we should call t Think for a while, maybe DISABLED is better, because it means the block doesn't allow to use anymore. DELETED may mean sth has been deleted, but in our case, it doesn't necessarily delete anything for being the state. http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-io-mgr.cc File be/src/runtime/io/disk-io-mgr.cc: http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-io-mgr.cc@396 PS5, Line 396: // Fetch the data from the source file (remote) to the destination file (local). : DCHECK(disk_file_dst_ != nullptr); > May need to add a comment to describe these two files. An example would be Done http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-io-mgr.cc@424 PS5, Line 424: read_buffer_bloc, MemBlockStatus::DISABLED, dstl, &read_buffer_lock)) { > Why can't we use this DISABLED memory block here? Looks like we should not Because the memory of the block is released, if this case happens, we can't use the memory block unless we can reserve the memory again. But we don't allow the block status to go back from DISABLED to RESERVED, and the case when DISABLED happens, it could be that the query is finished or cancelled, or all the pages of the block are read, all cases should have no need to fetch anything anymore. Added some comments. http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-io-mgr.cc@436 PS5, Line 436: atus = Status(TErrorCode::DISK_IO_ERROR, GetBackendStri > This is not used. The read timer works when it is destructed. It counts the time interval during its life cycle. http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-io-mgr.cc@445 PS5, Line 445: queue->read_size()->Update(local_file_size); : disk_file_dst_->SetReadBufferBlockStatus( > Should we set the memory block allocated above to DISABLED here so that it Do you mean delete the memory block if fails? Right now it is handled by the callback function in TmpFile, because the deletion of a block needs to add some number to the TmpFileMgr metrics according to different block status, and the DiskIoMgr isn't aware of TmpFileMgr's structure. So I guess the first question is the same that the callback function would set the block to DISABLED if fails. The failure handle is here: https://gerrit.cloudera.org/#/c/17979/6/be/src/runtime/tmp-file-mgr.cc@1040 Added some comments. http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/tmp-file-mgr.h File be/src/runtime/tmp-file-mgr.h: http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/tmp-file-mgr.h@147 PS5, Line 147: read_buffer_block_size_; > nit. This name sounds like the buffer is remote. Maybe renamed as mem_block Changed to read_buffer_block_size_. http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/tmp-file-mgr.h@151 PS5, Line 151: num_read_buffer_blocks_per_file_; > similar comment as above. num_mem_blocks_per_file instead? Changed to num_read_buffer_blocks_per_file_. http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/tmp-file-mgr.h@158 PS5, Line 158: max_read_buffer_size_; > max_mem_block_size_? Changed to max_read_buffer_size_. -- To view, visit http://gerrit.cloudera.org:8080/17979 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1dcc5d0881ffaeff09c5c514306cd668373ad31b Gerrit-Change-Number: 17979 Gerrit-PatchSet: 6 Gerrit-Owner: Yida Wu <wydbaggio...@gmail.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Yida Wu <wydbaggio...@gmail.com> Gerrit-Comment-Date: Tue, 09 Nov 2021 17:44:02 +0000 Gerrit-HasComments: Yes