Tim Armstrong has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/3518

Change subject: IMPALA-3780: avoid many small reads past end of block
......................................................................

IMPALA-3780: avoid many small reads past end of block

The text scanner had some pathological behaviour when reading
significantly past the end of it scan range. E.g. reading a 256mb string
that's split across blocks. ScannerContext defaulted to issuing 1kb
reads, even if the scan node requested significantly more data. E.g. if
the Parquet scanner called ReadBytes(16mb), this was chopped up into
1kb reads, which were reassembled in boundary_buffer_.

Increase the minimum read size in this case to 64kb. Reading that amount
of data should not have any significant overhead even if we only read
a few bytes past the end of the scan range.

ScannerContext implements a saner default algorithm that will work better
if scanners make many small reads: it starts with 64kb reads and doubles
the size of each successive read past the end of the scan range. We
also correct pass the 'read_past_size' into GetNextBuffer(), so that
we always read the right amount of data.

Also save some time by pre-sizing the boundary buffer to the correct
size instead of reallocating it multiple times.

Testing:
Add test case that exercises the code paths for very large strings.

Performance:
The queries in the test case are vastly faster than before. E.g. 0.6s
versus ~60s for the count(*) query.

Change-Id: Id90c5dea44f07dba5dd465cf325fbff28be34137
---
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/string-buffer.h
M tests/query_test/test_insert.py
6 files changed, 93 insertions(+), 40 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/18/3518/1
-- 
To view, visit http://gerrit.cloudera.org:8080/3518
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Id90c5dea44f07dba5dd465cf325fbff28be34137
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Tim Armstrong <[email protected]>

Reply via email to