[ https://issues.apache.org/jira/browse/IMPALA-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18021876#comment-18021876 ]
Zoltán Borók-Nagy commented on IMPALA-12479: -------------------------------------------- For Iceberg delete records we implemented IMPALA-13194. > Adaptive rowbatch serialization for duplicated STRING values > ------------------------------------------------------------ > > Key: IMPALA-12479 > URL: https://issues.apache.org/jira/browse/IMPALA-12479 > Project: IMPALA > Issue Type: Bug > Components: Backend > Reporter: Zoltán Borók-Nagy > Assignee: Daniel Becker > Priority: Major > Labels: impala-iceberg, performance > > When we are serializing row batches, or putting them into > BufferedTupleStreams, we always deep copy every tuple into a flat structure. > This means we will write the fixed length fields first, then all the varlen > fields are coming. > For low-NDV strings this means we will make a log of copies of the same > string values. > There is de-duplication in RowBatch::serialize(), but it only applies to > whole tuples, not slots: > [https://github.com/apache/impala/blob/ae14d78c8f2e8366e67aa8c39f0c02c60862905e/be/src/runtime/row-batch.cc#L234] > We could try to implement adaptive de-duplication of adjacent STRING slots. > We would sample the first N (e.g. 10000) tuples' string slots, and check what > is the ratio of the adjacent identical string values. Memory overhead should > be negligbible as we would use a single counter per string slot. > If we find that the ratio is quite high, e.g. >0.5, then we would > de-duplicate the affected string slots. I.e. adjacent identical string values > would use the previous string value's pointer. > This optimization could improve compression times a lot (because we would > need to compress much smaller buffers), and memory consumption of non-SCAN > fragments (which always work on duplicated string data). > This could be very beneficial for any low-NDV string column, and extremely > useful for Iceberg position delete records, where we have a lot of adjacent > duplicated long strings. Sorted input based on string values would also > benefit greatly. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org