[jira] [Updated] (IMPALA-12479) Adaptive rowbatch serialization for duplicated STRING values

Jira Tue, 03 Oct 2023 08:15:05 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zoltán Borók-Nagy updated IMPALA-12479:
---------------------------------------
    Description: 
When we are serializing row batches, or putting them into BufferedTupleStreams, 
we always deep copy every tuple into a flat structure. This means we will write 
the fixed length fields first, then all the varlen fields are coming.

For low-NDV strings this means we will make a log of copies of the same string 
values.

There is de-duplication in RowBatch::serialize(), but it only applies to whole 
tuples, not slots:

[https://github.com/apache/impala/blob/ae14d78c8f2e8366e67aa8c39f0c02c60862905e/be/src/runtime/row-batch.cc#L234]

We could try to implement adaptive de-duplication of adjacent STRING slots. We 
would sample the first N (e.g. 10000) tuples' string slots, and check what is 
the ratio of the adjacent identical string values. Memory overhead should be 
negligbible as we would use a single counter per string slot.

If we find that the ratio is quite high, e.g. >0.5, then we would de-duplicate 
the affected string slots. I.e. adjacent identical string values would use the 
previous string value's pointer.

This optimization could improve compression times a lot (because we would need 
to compress much smaller buffers), and memory consumption of non-SCAN fragments 
(which always work on duplicated string data).

This could be very beneficial for any low-NDV string column, and extremely 
useful for Iceberg position delete records, where we have a lot of adjacent 
duplicated long strings. Sorted input based on string values would also benefit 
greatly.

  was:
When we are serializing row batches, or putting them into BufferedTupleStreams, 
we always deep copy every tuple into a flat structure. This means we will write 
the fixed length fields first, then all the varlen fields are coming.

For low-NDV strings this means we will make a log of copies of the same string 
values.

There is de-duplication in RowBatch::serialize(), but it only applies to whole 
tuples, not slots:

[https://github.com/apache/impala/blob/ae14d78c8f2e8366e67aa8c39f0c02c60862905e/be/src/runtime/row-batch.cc#L234]

We could try to implement adaptive de-duplication of adjacent STRING slots. We 
would sample the first N (e.g. 10000) tuples' string slots, and check what is 
the ratio of the adjacent identical string values. Memory overhead should be 
negligbible as we would use a single counter per string slot.

If we find that the ratio is quite high, e.g. >0.5, then we would de-duplicate 
the affected string slots. I.e. adjacent identical string values would use the 
previous string value's pointer.

This optimization could improve compression times a lot (because we would need 
to compress much smaller buffers), and memory consumption of non-SCAN fragments 
(which always work on duplicated string data).

This could be very beneficial for any low-NDV string column, and extremely 
useful for Iceberg position delete records, where we have a lot of adjacent 
duplicated long strings.


> Adaptive rowbatch serialization for duplicated STRING values
> ------------------------------------------------------------
>
>                 Key: IMPALA-12479
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12479
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: performance
>
> When we are serializing row batches, or putting them into 
> BufferedTupleStreams, we always deep copy every tuple into a flat structure. 
> This means we will write the fixed length fields first, then all the varlen 
> fields are coming.
> For low-NDV strings this means we will make a log of copies of the same 
> string values.
> There is de-duplication in RowBatch::serialize(), but it only applies to 
> whole tuples, not slots:
> [https://github.com/apache/impala/blob/ae14d78c8f2e8366e67aa8c39f0c02c60862905e/be/src/runtime/row-batch.cc#L234]
> We could try to implement adaptive de-duplication of adjacent STRING slots. 
> We would sample the first N (e.g. 10000) tuples' string slots, and check what 
> is the ratio of the adjacent identical string values. Memory overhead should 
> be negligbible as we would use a single counter per string slot.
> If we find that the ratio is quite high, e.g. >0.5, then we would 
> de-duplicate the affected string slots. I.e. adjacent identical string values 
> would use the previous string value's pointer.
> This optimization could improve compression times a lot (because we would 
> need to compress much smaller buffers), and memory consumption of non-SCAN 
> fragments (which always work on duplicated string data).
> This could be very beneficial for any low-NDV string column, and extremely 
> useful for Iceberg position delete records, where we have a lot of adjacent 
> duplicated long strings. Sorted input based on string values would also 
> benefit greatly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-12479) Adaptive rowbatch serialization for duplicated STRING values

Reply via email to