Zoltán Borók-Nagy created IMPALA-13194:
------------------------------------------

             Summary: Fast-serialize position delete records
                 Key: IMPALA-13194
                 URL: https://issues.apache.org/jira/browse/IMPALA-13194
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend
            Reporter: Zoltán Borók-Nagy


Currently the serialization of position delete records are very wasteful. The 
records contain slots 'file_path' and 'pos'. And what we do during 
serialization is the following.
 # Write fixed-size tuple that have a StringValue and a BigInt slot (20 bytes 
in total)
 # We copy the StringValue's contents after the tuple.
 # We convert the StringValue slot to be an offset to the string data

So we end up having something like this:
{noformat}
+-------------+--------+----------------+-------------+--------+----------------+-----+
 | StringValue | BigInt |   File path    | StringValue | BigInt |   File path   
 | ... | 
+-------------+--------+----------------+-------------+--------+----------------+-----+
 | ptr, len    |     42 | /.../a.parquet | ptr, len    |     43 | 
/.../a.parquet | ... | 
+-------------+--------+----------------+-------------+--------+----------------+-----+
{noformat}
This is very redundant to store the file paths that way, and at the end we will 
have a huge buffer that we need to compress and send over the network. 
Moreover, we copy the file paths in memory twice:
 # From input row batch to the KrpcDataStreamSender::Channel's temporary row 
batch
 # From the temporary row batch to the outbound row batch (during serialization)

The position delete files store the delete records in ascending order. This 
means adjacent records mostly have the same file path. So we could just buffer 
the position delete records up to the Channel's capacity, then serialize the 
data in a more efficient way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to