Hello Lars Volker, Zoltan Borok-Nagy, Csaba Ringhofer, Alex Behm, Mostafa 
Mokhtar, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8319

to look at the new patch set (#10).

Change subject: IMPALA-4123: Columnar decoding in Parquet
......................................................................

IMPALA-4123: Columnar decoding in Parquet

These changes should enable further optimizations because more time is
spent in simple kernel functions, e.g. UnpackAndDecode32Values() for
dictionary decompression.

Snappy decompression now seems to be the main CPU bottleneck for
decoding snappy-compressed Parquet.

Perf:
Running TPC-H scale factor 60 on uncompressed and snappy parquet
both showed a ~4% speedup overall.

Microbenchmarks on uncompressed parquet show scans only doing
dictionary decoding on uncompressed Parquet is ~75% faster:

   set mt_dop=1;
   select min(l_returnflag) from lineitem;

Testing:
We have alltypes agg with a mix of null and non-null.

Many tables have long runs of non-null values.

Added new test data and coverage:
* a test table manynulls with long runs of null values.
* a large CHAR test table
* missing coverage for materialising pos slot in flattened nested types
  scan.
* Extended dict test to test longer runs.
* A larger version of complextypestbl with interesting collection
  shapes - NULL collections, empty collections, etc, particularly runs
  of collections with the same shape.
* Test interaction of timestamp validation with conversion
* Ran code coverage build to confirm all code paths are tested

TODO:
* Before merging, run fuzz test for longer
* ASAN and exhaustive runs.

Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/rle-encoding.h
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/out_of_range_timestamp2_hive_211.parquet
A testdata/data/out_of_range_timestamp_hive_211.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/chars.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-position.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-local-tz-conversion.test
A testdata/workloads/functional-query/queries/QueryTest/scanners-many-nulls.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/query_test/test_scanners.py
22 files changed, 705 insertions(+), 75 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/8319/10
--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 10
Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>

Reply via email to