Zoltán Borók-Nagy created IMPALA-13090:
------------------------------------------
Summary: De-duplicate adjacent strings when serializing outbound
row batches
Key: IMPALA-13090
URL: https://issues.apache.org/jira/browse/IMPALA-13090
Project: IMPALA
Issue Type: Improvement
Reporter: Zoltán Borók-Nagy
RowBatch::Serialize() uses Tuple::DeepCopy() to serialize tuples. When the
tuple has string slots it copies the strings after the fixed-len tuple data.
This means low-NDV string columns are getting copied many times unnecessarily.
E.g. in the SCAN fragments we usually have only a few replicas of low-NDV
strings, as e.g. they point to the Parquet dictionary entry. But whenever we
send tuples over the network the strings are getting as many instances as there
are rows. This results in much higher memory consumption than needed, and it
affects all the fragments up in the chain.
In some cases it can really hurt, e.g. when we send the file paths of Iceberg
position delete records, as they have a very low NDV, but can be many records.
In my experiments de-duplicating adjacent string records have a very little
overhead, but can still improve perf and memory consumption significantly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)