Hello Jean-Daniel Cryans, Adar Dembo,
I'd like you to do a code review. Please visit
http://gerrit.cloudera.org:8080/5689
to review the following change.
Change subject: tpch: improve encodings and compression
......................................................................
tpch: improve encodings and compression
Previously all of the columns had been hard-coded to 'PLAIN' encoding.
This is no longer our default, nor would we recommend it for the types
of data used in the TPCH dataset.
This switches to default encodings everywhere, and also enables LZ
compression on the "Comment" column.
The reduction in data size is as follows:
original:
size: 993MB
median scan time for TPCH1 query: 0.8685 sec
with LZ4 'comment':
size: 901MB (1.1x compression vs original)
scan time: unaffected (query does not read comment column)
with LZ4 'comment' and new encodings:
size: 342MB (2.9x compression vs original)
median scan time: 0.8488 sec
Per the above, the on-disk size is reduced by almost 3x and the scan
performance is improved by a couple percent (perhaps within the realm of
measurement error). This workload is small enough to be fully
RAM-resident, but in a larger dataset which is disk-bound on reads, the
space reduction should yield a corresponding improvement in scan performance.
Change-Id: I168eb1c4ff619556f6879a20fe335a6158d0e81b
---
M src/kudu/benchmarks/tpch/tpch-schemas.h
1 file changed, 9 insertions(+), 8 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/89/5689/1
--
To view, visit http://gerrit.cloudera.org:8080/5689
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: newchange
Gerrit-Change-Id: I168eb1c4ff619556f6879a20fe335a6158d0e81b
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Jean-Daniel Cryans <[email protected]>