[jira] [Created] (IMPALA-13090) De-duplicate adjacent strings when serializing outbound row batches

Jira Thu, 16 May 2024 07:51:01 -0700

Zoltán Borók-Nagy created IMPALA-13090:
------------------------------------------


             Summary: De-duplicate adjacent strings when serializing outbound 
row batches
                 Key: IMPALA-13090
                 URL: https://issues.apache.org/jira/browse/IMPALA-13090
             Project: IMPALA
          Issue Type: Improvement
            Reporter: Zoltán Borók-Nagy


RowBatch::Serialize() uses Tuple::DeepCopy() to serialize tuples. When the 
tuple has string slots it copies the strings after the fixed-len tuple data.
This means low-NDV string columns are getting copied many times unnecessarily.

E.g. in the SCAN fragments we usually have only a few replicas of low-NDV 
strings, as e.g. they point to the Parquet dictionary entry. But whenever we 
send tuples over the network the strings are getting as many instances as there 
are rows. This results in much higher memory consumption than needed, and it 
affects all the fragments up in the chain.

In some cases it can really hurt, e.g. when we send the file paths of Iceberg 
position delete records, as they have a very low NDV, but can be many records.

In my experiments de-duplicating adjacent string records have a very little 
overhead, but can still improve perf and memory consumption significantly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IMPALA-13090) De-duplicate adjacent strings when serializing outbound row batches

Reply via email to