Riza Suminto has uploaded a new patch set (#22) to the change originally 
created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................

IMPALA-6636: Use async IO in ORC scanner

This patch implements async IO in the ORC scanner. For each ORC stripe,
we begin with iterating the column streams. If a column stream is
possible for async IO, it will create ColumnRange, register
ScannerContext::Stream for that ORC stream, and start the stream. We
modify HdfsOrcScanner::ScanRangeInputStream::read to check whether there
is a matching ColumnRange for the given offset and length. If so, the
reading continue through HdfsOrcScanner::ColumnRange::read.

We leverage existing async IO methods from HdfsParquetScanner class for
initial memory allocations. We moved related methods such as
DivideReservationBetweenColumns and ComputeIdealReservation up to
HdfsColumnarScanner class.

Planner calculates the memory reservation differently between async
Parquet and async ORC. In async Parquet, the planner calculates the
column memory reservation and relies on the backend to divide them as
needed. In async ORC, the planner needs to split the column's memory
reservation based on the estimated number of streams for that column
type. For example, a string column with a 4MB memory estimate will need
to split that estimate into four 1MB because it might use dictionary
encoding with four streams (PRESENT, DATA, DICTIONARY_DATA, and LENGTH
stream). This splitting is required because each async IO stream needs
to start with an 8KB (min_buffer_size) initial memory reservation.

To show the improvement from ORC async IO, we contrast the total time
and geomean (in milliseconds) to run full TPC-DS 10 TB, 19 executors,
with varying ORC_ASYNC_IO and DISABLE_DATA_CACHE options as follow:

+----------------------+------------------+------------------+
| Total time           | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |          3511075 |          3484736 |
| DISABLE_DATA_CACHE=1 |          5243337 |          4370095 |
+----------------------+------------------+------------------+

+----------------------+------------------+------------------+
| Geomean              | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |      12786.58042 |      12454.80365 |
| DISABLE_DATA_CACHE=1 |      23081.10888 |      16692.31512 |
+----------------------+------------------+------------------+

Testing:
- Pass core tests.
- Pass core e2e tests with ORC_ASYNC_READ=1.

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test
17 files changed, 552 insertions(+), 229 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/22
--
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 22
Gerrit-Owner: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>

Reply via email to