Hello Tim Armstrong, Csaba Ringhofer, Impala Public Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/14804
to look at the new patch set (#3).
Change subject: IMPALA-9174: Speedup allocations of ORC Scanner
......................................................................
IMPALA-9174: Speedup allocations of ORC Scanner
The ORC library provides a hook for its clients to plugin a custom
memory pool. This memory pool needs to override the 'malloc()' and
'free()' methods. Impala has its own memory pool named OrcMemPool.
Impala's OrcMemPool used to have an internal unordered_map to keep
track of its allocations. In 'free()' it used the map to lookup the
size of the allocated byte array. We need this information in 'free()'
because of memory tracking. Therefore for each 'malloc()' and 'free()'
there was an additional allocation/deallocation by the unordered_map.
malloc already stores the size of the allocated bytes in front of the
allocated bytes. That's why 'free()' doesn't take a 'size' parameter,
it just searches for the size information next to the pointed bytes.
TC Malloc provides a function called 'tc_malloc_size()' that reveals
that information programmatically, so instead of the hash table we
can just use this function to retrieve the size.
OrcMemPool also had a method called 'FreeAll()' which freed all
allocated memory. This was a no-op because the library only uses the
memory pool to allocate memory for the data buffers, and they free their
memory in their destructors. On the other hand, it provided some kind of
guard against memory leaks in the ORC library. We can add some checks to
the destructor of OrcMemPool to detect leaks if we don't trust the
library's memory management.
Performance
I ran single_node_perf_run.py to measure the performance gain:
TPCH scale 5:
+----------+-------------------+---------+------------+------------+----------------+
| Workload | File Format | Avg (s) | Delta(Avg) | GeoMean(s) |
Delta(GeoMean) |
+----------+-------------------+---------+------------+------------+----------------+
| TPCH(5) | orc / def / block | 3.73 | -0.62% | 2.95 | -1.22%
|
+----------+-------------------+---------+------------+------------+----------------+
(R) Regression: TPCH(5) TPCH-Q18 [orc / def / block] (5.22s -> 5.58s [+6.92%])
(I) Improvement: TPCH(5) TPCH-Q2 [orc / def / block] (1.62s -> 1.44s [-11.26%])
(I) Improvement: TPCH(5) TPCH-Q6 [orc / def / block] (1.33s -> 1.14s [-14.68%])
TPCH scale 10:
+----------+-------------------+---------+------------+------------+----------------+
| Workload | File Format | Avg (s) | Delta(Avg) | GeoMean(s) |
Delta(GeoMean) |
+----------+-------------------+---------+------------+------------+----------------+
| TPCH(10) | orc / def / block | 6.57 | -0.64% | 4.93 | -0.94%
|
+----------+-------------------+---------+------------+------------+----------------+
(I) Improvement: TPCH(10) TPCH-Q20 [orc / def / block] (5.94s -> 5.49s [-7.54%])
(I) Improvement: TPCH(10) TPCH-Q13 [orc / def / block] (5.33s -> 4.70s
[-11.84%])
Change-Id: Ia09e746883176d6f955c1718267bf55e2abb239b
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
2 files changed, 35 insertions(+), 32 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/14804/3
--
To view, visit http://gerrit.cloudera.org:8080/14804
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia09e746883176d6f955c1718267bf55e2abb239b
Gerrit-Change-Number: 14804
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>