Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/21563 )
Change subject: IMPALA-13194: Fast-serialize position delete records ...................................................................... IMPALA-13194: Fast-serialize position delete records Currently the serialization of position delete records are very wasteful. The records contain slots 'file_path' and 'pos'. And what we do during serialization is the following: 1. Write fixed-size tuples that have a StringValue and a BigInt slot 2. Copy the StringValue's contents after the tuple. 3. Convert the StringValue ptr to be an offset to the string data So we end up having something like this: +-------------+--------+----------------+-------------+--------+----------------+-----+ | StringValue | BigInt | File path | StringValue | BigInt | File path | ... | +-------------+--------+----------------+-------------+--------+----------------+-----+ | ptr, len | 42 | /.../a.parquet | ptr, len | 43 | /.../a.parquet | ... | +-------------+--------+----------------+-------------+--------+----------------+-----+ This is very redundant to store the file paths that way, and in the end we will have a huge buffer that we need to compress and send over the network. Moreover, we copy the file paths in memory twice: 1. From input row batch to the KrpcDataStreamSender::Channel's temporary row batch 2. From the temporary row batch to the outbound row batch (during serialization) The position delete files store the delete records in ascending order. This means adjacent records mostly have the same file path. So we could just buffer the position delete records up to the Channel's capacity, then serialize the data in a more efficient way. With this patch, serialized data will look like this: +----------------+-------------+--------+-------------+--------+-----+ | File path | StringValue | BigInt | StringValue | BigInt | ... | +----------------+-------------+--------+-------------+--------+-----+ | /.../a.parquet | ptr, len | 42 | ptr, len | 43 | ... | +----------------+-------------+--------+-------------+--------+-----+ File path, then tuples with the same file path, after that comes the next file path and tuples associated with that one, and so on. Measurements: 07:EXCHANGE : 1m ==> 52s F02:EXCHANGE SENDER: 1m2s ==> 16s Change-Id: I6095f318e3d06dedb4197681156b40dd2a326c6f Reviewed-on: http://gerrit.cloudera.org:8080/21563 Reviewed-by: Csaba Ringhofer <[email protected]> Tested-by: Impala Public Jenkins <[email protected]> --- M be/src/exec/iceberg-delete-builder.h M be/src/runtime/CMakeLists.txt A be/src/runtime/iceberg-position-delete-collector.h M be/src/runtime/krpc-data-stream-sender.cc M be/src/runtime/krpc-data-stream-sender.h A be/src/runtime/outbound-row-batch.cc M be/src/runtime/outbound-row-batch.h M be/src/runtime/row-batch.cc M be/src/runtime/string-value.h 9 files changed, 395 insertions(+), 68 deletions(-) Approvals: Csaba Ringhofer: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/21563 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I6095f318e3d06dedb4197681156b40dd2a326c6f Gerrit-Change-Number: 21563 Gerrit-PatchSet: 8 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
