Yida Wu has posted comments on this change. ( http://gerrit.cloudera.org:8080/17979 )
Change subject: IMPALA-10791 Add batching reading for remote temporary files ...................................................................... Patch Set 4: (9 comments) http://gerrit.cloudera.org:8080/#/c/17979/1/be/src/runtime/io/disk-file.h File be/src/runtime/io/disk-file.h: http://gerrit.cloudera.org:8080/#/c/17979/1/be/src/runtime/io/disk-file.h@195 PS1, Line 195: > Can we use MemBlockState state_ here? Because the naming in DiskFile is "file_status_", maybe just keep them the same, otherwise it may be good to change all of them, including the interface names, but should be some work. http://gerrit.cloudera.org:8080/#/c/17979/3/be/src/runtime/io/request-context.cc File be/src/runtime/io/request-context.cc: http://gerrit.cloudera.org:8080/#/c/17979/3/be/src/runtime/io/request-context.cc@201 PS3, Line 201: unstarted_remote_file_op_ranges_; > Maybe named to unstarted_remote_file_op_ranges_? Done http://gerrit.cloudera.org:8080/#/c/17979/3/be/src/runtime/io/request-ranges.h File be/src/runtime/io/request-ranges.h: http://gerrit.cloudera.org:8080/#/c/17979/3/be/src/runtime/io/request-ranges.h@118 PS3, Line 118: WRITE, > May need to explain what it is. Done http://gerrit.cloudera.org:8080/#/c/17979/3/be/src/runtime/io/request-ranges.h@702 PS3, Line 702: > nit. upload the file to a remote location. Done http://gerrit.cloudera.org:8080/#/c/17979/3/be/src/runtime/io/request-ranges.h@708 PS3, Line 708: > nit. the fetch file operation from a remote site. Done http://gerrit.cloudera.org:8080/#/c/17979/3/be/src/runtime/io/scan-range.cc File be/src/runtime/io/scan-range.cc: http://gerrit.cloudera.org:8080/#/c/17979/3/be/src/runtime/io/scan-range.cc@171 PS3, Line 171: the range : // is supposed to be read in one round. > Suggest to remove as there is no guarantee. Changed to "supposed". http://gerrit.cloudera.org:8080/#/c/17979/3/be/src/runtime/io/scan-range.cc@178 PS3, Line 178: read_status > need to check the status. Done http://gerrit.cloudera.org:8080/#/c/17979/2/be/src/runtime/tmp-file-mgr.cc File be/src/runtime/tmp-file-mgr.cc: http://gerrit.cloudera.org:8080/#/c/17979/2/be/src/runtime/tmp-file-mgr.cc@257 PS2, Line 257: Status setup_read_buffer_status = SetUpReadBufferParams(); : if (!setup_read_buffer_status.ok()) { > If handling the rare case is simple task, I feel we should do so. Changed. If the file size is smaller than the max block size, set the block size as file size. Otherwise block size is the max block size, which is 16MB. http://gerrit.cloudera.org:8080/#/c/17979/2/be/src/runtime/tmp-file-mgr.cc@1039 PS2, Line 1039: read_buffer_block->NotifyAllWaits(); > In practice, the read buffer memory is always full during the big queries ( Have a simple test today (15x tpcds, q67, c5d.4xlarge 16u32g, 1G read buffer). 1. Disabled is set: (Time: 135s) (Data Read: 13.8GB) 2. Disabled is not set: (Time: 150s) (Data Read: 17.7GB) As expected, if the disabled is not set, performance is worse because more data is read (more duplicated read). It could be a little different for other queries, but if the read buffer is not available (full) for most of the time, which is quite likely when spilling large amount of data, disabling the file from batching read when failing to reserve space could be a better solution. I think the next optimization is to make the read buffer more available, maybe using a better eviction policy. -- To view, visit http://gerrit.cloudera.org:8080/17979 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1dcc5d0881ffaeff09c5c514306cd668373ad31b Gerrit-Change-Number: 17979 Gerrit-PatchSet: 4 Gerrit-Owner: Yida Wu <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Yida Wu <[email protected]> Gerrit-Comment-Date: Tue, 02 Nov 2021 01:05:56 +0000 Gerrit-HasComments: Yes
