Tim Armstrong has uploaded a new change for review. http://gerrit.cloudera.org:8080/3518
Change subject: IMPALA-3780: avoid many small reads past end of block ...................................................................... IMPALA-3780: avoid many small reads past end of block The text scanner had some pathological behaviour when reading significantly past the end of it scan range. E.g. reading a 256mb string that's split across blocks. ScannerContext defaulted to issuing 1kb reads, even if the scan node requested significantly more data. E.g. if the Parquet scanner called ReadBytes(16mb), this was chopped up into 1kb reads, which were reassembled in boundary_buffer_. Increase the minimum read size in this case to 64kb. Reading that amount of data should not have any significant overhead even if we only read a few bytes past the end of the scan range. ScannerContext implements a saner default algorithm that will work better if scanners make many small reads: it starts with 64kb reads and doubles the size of each successive read past the end of the scan range. We also correct pass the 'read_past_size' into GetNextBuffer(), so that we always read the right amount of data. Also save some time by pre-sizing the boundary buffer to the correct size instead of reallocating it multiple times. Testing: Add test case that exercises the code paths for very large strings. Performance: The queries in the test case are vastly faster than before. E.g. 0.6s versus ~60s for the count(*) query. Change-Id: Id90c5dea44f07dba5dd465cf325fbff28be34137 --- M be/src/exec/base-sequence-scanner.cc M be/src/exec/hdfs-text-scanner.h M be/src/exec/scanner-context.cc M be/src/exec/scanner-context.h M be/src/runtime/string-buffer.h M tests/query_test/test_insert.py 6 files changed, 93 insertions(+), 40 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/18/3518/1 -- To view, visit http://gerrit.cloudera.org:8080/3518 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Id90c5dea44f07dba5dd465cf325fbff28be34137 Gerrit-PatchSet: 1 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Tim Armstrong <[email protected]>
