Yida Wu has posted comments on this change. ( http://gerrit.cloudera.org:8080/17979 )
Change subject: IMPALA-10791 Add batch reading for remote temporary files ...................................................................... Patch Set 12: (11 comments) http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/disk-file.h File be/src/runtime/io/disk-file.h: http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/disk-file.h@55 PS12, Line 55: class MemBlock { > This seems like a fairly leaky abstraction. It mostly exists to handle the It is quite a simple class working closely with the DiskFile and TmpFile stuff, may rely on the TmpFile stuff to reserve the memory (which relies on a global variable in TmpFileMgr) for it before having the right to allocate, and rely on the IO functions to write and read directly on the data_ under certain block status. Think for a while for a better abstraction, but dont have much idea for now while guarantee the efficiency and safety without too much complexity. I think one of the good thing here is the status transition is clear and one-way down, what the user needs to do, in most cases, is to lock on the block spinlock, do something, change the status if needed. http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/disk-file.h@220 PS12, Line 220: std::unique_ptr<int64_t[]> page_cnts_per_block_; > This seems a little funny, but I think I see why it's necessary. A const ve Maybe personal preference. Prefer a fixed-size array if the size is fixed and type is simple. Looks more clear. http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/disk-file.h@446 PS12, Line 446: // Helper function to check the status of a read buffe block. > nit: buffe -> buffer Done http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/disk-file.h@458 PS12, Line 458: // Helper function to delete the read buffe block. > nit: buffe -> buffer Done http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/disk-file.h@516 PS12, Line 516: /// The lock also protects the memory blocks from destruction, if the disk file has. > "if the disk file has" seems like an incomplete sentence. Done http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/disk-file.cc File be/src/runtime/io/disk-file.cc: http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/disk-file.cc@99 PS12, Line 99: void MemBlock::Delete(bool* reserved, bool* allocated) { > Some unit tests around MemBlock transitions might be useful. Added a unit testcase "MemBlockTest" should have covered the MemBlock status transitions. http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/request-ranges.h File be/src/runtime/io/request-ranges.h: http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/request-ranges.h@152 PS12, Line 152: RequestRange(RequestType::type request_type, int disk_id = -1, int64_t offset = -1) > How does an offset of -1 differ from 0? I think the change is to allow assigning the offset in the constructor function. -1 is an invalid value for the offset, that means when we need to set the offset of the range before using the range, there are some assertion about it like "DCHECK_GE(offset, 0);" when using it. http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/scan-range.cc File be/src/runtime/io/scan-range.cc: http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/scan-range.cc@156 PS12, Line 156: if (!use_mem_buffer) { > Since we asserted above that use_local_buff implies !use_mem_buffer, this c Added some comments. The logic here is to set the correct file reader if it involves reading from the file system (could be local or remote file). If it uses memory buffer, it doesn't need to set this. There is a case if use_mem_buffer and use_local_buff are all false, this case we will use the original file_reader_ to get the range. http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/scan-range.cc@291 PS12, Line 291: // 1. If it is the local buffer file is not deleted(evicted) yet. > Change to "If the local buffer file ..." Done http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/tmp-file-mgr-test.cc File be/src/runtime/tmp-file-mgr-test.cc: http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/tmp-file-mgr-test.cc@1835 PS12, Line 1835: int64_t file_size = 512 * 1024 * 1024; > Why was this changed? The maximum allowed file size for one remote file increases to 512MB in this patch defined in MAX_REMOTE_TMPFILE_SIZE_THRESHOLD_MB in tmp-file-mgr.cc. The reason for the increase is a bigger file size may have a better upload performance. By default we are still using 256MB, but may give users a chance if they would like to try a bigger size. http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/tmp-file-mgr.cc File be/src/runtime/tmp-file-mgr.cc: http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/tmp-file-mgr.cc@117 PS12, Line 117: "Set if the system uses batch reading for the remote temporary files."); > This description could be more descriptive. Maybe "Set to prefetch addition Added some description. But the prefetch is only in part 2, will add the prefetch description in part 2. -- To view, visit http://gerrit.cloudera.org:8080/17979 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1dcc5d0881ffaeff09c5c514306cd668373ad31b Gerrit-Change-Number: 17979 Gerrit-PatchSet: 12 Gerrit-Owner: Yida Wu <[email protected]> Gerrit-Reviewer: Abhishek Rawat <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Yida Wu <[email protected]> Gerrit-Comment-Date: Tue, 28 Jun 2022 00:46:03 +0000 Gerrit-HasComments: Yes
