[
https://issues.apache.org/jira/browse/IMPALA-9226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Norbert Luksa resolved IMPALA-9226.
-----------------------------------
Fix Version/s: Impala 4.0
Resolution: Fixed
> Improve string allocations of the ORC scanner
> ---------------------------------------------
>
> Key: IMPALA-9226
> URL: https://issues.apache.org/jira/browse/IMPALA-9226
> Project: IMPALA
> Issue Type: Improvement
> Reporter: Zoltán Borók-Nagy
> Assignee: Norbert Luksa
> Priority: Major
> Labels: orc
> Fix For: Impala 4.0
>
>
> Currently the ORC scanner allocates new memory for each string values (except
> for fixed size strings):
> [https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/orc-column-readers.cc#L172]
> Besides the too many allocations and copying it's also bad for memory
> locality.
> Since ORC-501 StringVectorBatch has a member named 'blob' that contains the
> strings in the batch:
> [https://github.com/apache/orc/blob/branch-1.6/c%2B%2B/include/orc/Vector.hh#L126]
> 'blob' has type DataBuffer which is movable, so Impala might be able to get
> ownership of it. Or, at least we could copy the whole blob array instead of
> copying the strings one-by-one.
> ORC-501 is included in ORC version 1.6, but Impala currently only uses ORC
> 1.5.5.
> ORC 1.6 also introduces a new string vector type, EncodedStringVectorBatch:
> [https://github.com/apache/orc/blob/e40b9a7205d51995f11fe023c90769c0b7c4bb93/c%2B%2B/include/orc/Vector.hh#L153]
> It uses dictionary encoding for storing the values. Impala could copy/move
> the dictionary as well.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]