[
https://issues.apache.org/jira/browse/IMPALA-13194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zoltán Borók-Nagy reassigned IMPALA-13194:
------------------------------------------
Assignee: Zoltán Borók-Nagy
> Fast-serialize position delete records
> --------------------------------------
>
> Key: IMPALA-13194
> URL: https://issues.apache.org/jira/browse/IMPALA-13194
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Reporter: Zoltán Borók-Nagy
> Assignee: Zoltán Borók-Nagy
> Priority: Major
> Labels: impala-iceberg
>
> Currently the serialization of position delete records are very wasteful. The
> records contain slots 'file_path' and 'pos'. And what we do during
> serialization is the following.
> # Write fixed-size tuple that have a StringValue and a BigInt slot (20 bytes
> in total)
> # We copy the StringValue's contents after the tuple.
> # We convert the StringValue slot to be an offset to the string data
> So we end up having something like this:
> {noformat}
> +-------------+--------+----------------+-------------+--------+----------------+-----+
> | StringValue | BigInt | File path | StringValue | BigInt | File path
> | ... |
> +-------------+--------+----------------+-------------+--------+----------------+-----+
> | ptr, len | 42 | /.../a.parquet | ptr, len | 43 |
> /.../a.parquet | ... |
> +-------------+--------+----------------+-------------+--------+----------------+-----+
> {noformat}
> This is very redundant to store the file paths that way, and at the end we
> will have a huge buffer that we need to compress and send over the network.
> Moreover, we copy the file paths in memory twice:
> # From input row batch to the KrpcDataStreamSender::Channel's temporary row
> batch
> # From the temporary row batch to the outbound row batch (during
> serialization)
> The position delete files store the delete records in ascending order. This
> means adjacent records mostly have the same file path. So we could just
> buffer the position delete records up to the Channel's capacity, then
> serialize the data in a more efficient way.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]