klion26 commented on PR #8031:
URL: https://github.com/apache/arrow-rs/pull/8031#issuecomment-3139775558
@alamb @scovich
I've tried to benchmark three implementations
1. the current implementation with `PackedU32Iterator`
2. the implementation with `append_packed_u32` into a tmp buf, then splice
into the parent's buffer like we did for `ListBuilder::finish`
<details>
<summary>append_packed_u32 into a tmp buf</summary>
let num_fileds_size = if is_large { 4 } else { 1 }; // is_large: 4
bytes, else 1 byte.
let header_size = 1 + // header byte (i.e., `object_header`)
num_fileds_size + // num_fields_size
(num_fields * id_size as usize) + // field IDs
((num_fields + 1) * offset_size as usize); // field offsets +
data_size
let mut bytes_to_splice = Vec::with_capacity(header_size + 3);
// Write header byte
let header = object_header(is_large, id_size, offset_size);
bytes_to_splice.push(header);
append_packed_u32(&mut bytes_to_splice, num_fields as u32,
num_fileds_size);
for field_id in self.fields.keys() {
append_packed_u32(&mut bytes_to_splice, *field_id, id_size as
usize);
}
for offset in self.fields.values() {
append_packed_u32(&mut bytes_to_splice, *offset as u32, id_size
as usize);
}
append_packed_u32(&mut bytes_to_splice, data_size as u32,
offset_size as usize);
let starting_offset = self.parent_value_offset_base;
// Shift existing data to make room for the header
let buffer = parent_buffer.inner_mut();
buffer.splice(starting_offset..starting_offset, bytes_to_splice);
</details>
3. the implementation with using `PackedU32Iterator` and extend, then splice
into the iterator into the parent's buffer
<details>
<summary> PackedU32Iterator with extend </summary>
let num_fields_bytes = num_fields.to_le_bytes();
let num_elements_bytes =
num_fields_bytes.iter().take(num_fileds_size).copied();
let fields = PackedU32Iterator::new(
id_size as usize,
self.fields.keys().map(|offset| offset.to_le_bytes()),
);
let offsets = PackedU32Iterator::new(
offset_size as usize,
self.fields
.values()
.map(|offset| (*offset as u32).to_le_bytes()),
);
let data_size_bytes = (data_size as u32).to_le_bytes();
let data_size_bytes_iter = data_size_bytes.iter().take(offset_size
as usize).copied();
let header = object_header(is_large, id_size, offset_size);
let mut bytes_to_splice = vec![header];
bytes_to_splice.extend(num_elements_bytes);
bytes_to_splice.extend(fields);
bytes_to_splice.extend(offsets);
bytes_to_splice.extend(data_size_bytes_iter);
let starting_offset = self.parent_value_offset_base;
// Shift existing data to make room for the header
let buffer = parent_buffer.inner_mut();
buffer.splice(starting_offset..starting_offset, bytes_to_splice);
</details>
The results show that the first win(but not too much).
## PackedU32Iterator
```
group
7978_optimize_header_generatation_with_iterator main
-----
----------------------------------------------- ----
batch_json_string_to_variant json_list 8k string 1.00
28.4±1.05ms ? ?/sec 1.03 29.1±0.97ms ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document) 1.00
324.6±8.35ms ? ?/sec 1.01 327.6±8.18ms ? ?/sec
batch_json_string_to_variant repeated_struct 8k string 1.00
11.0±0.40ms ? ?/sec 1.01 11.2±0.34ms ? ?/sec
variant_get_primitive 1.00
1105.8±33.81µs ? ?/sec 1.01 1121.0±43.28µs ? ?/sec
```
## `append_packed_u32` into a tmp buf
```
group
7978_optimize_header_generation_logic_object_builder main
-----
---------------------------------------------------- ----
batch_json_string_to_variant json_list 8k string 1.00
28.5±1.09ms ? ?/sec 1.02 29.1±0.97ms ?
?/sec
batch_json_string_to_variant random_json(2633 bytes per document) 1.00
328.9±8.20ms ? ?/sec 1.00 327.6±8.18ms ?
?/sec
batch_json_string_to_variant repeated_struct 8k string 1.00
11.3±0.31ms ? ?/sec 1.00 11.2±0.34ms ?
?/sec
variant_get_primitive 1.00
1106.9±40.65µs ? ?/sec 1.01 1121.0±43.28µs ?
?/sec
```
## `PackedU32Iterator` with extend
```
group
7978_optimizer_header_generation_with_extend main
-----
-------------------------------------------- ----
batch_json_string_to_variant json_list 8k string 1.04
30.2±1.82ms ? ?/sec 1.00 29.1±0.97ms ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document) 1.06
345.8±9.52ms ? ?/sec 1.00 327.6±8.18ms ? ?/sec
batch_json_string_to_variant repeated_struct 8k string 1.12
12.5±0.77ms ? ?/sec 1.00 11.2±0.34ms ? ?/sec
variant_get_primitive 1.00
1123.7±50.11µs ? ?/sec 1.00 1121.0±43.28µs ? ?/sec
```
From the conversation of previous
prs[[1](https://github.com/apache/arrow-rs/pull/7987#pullrequestreview-3057148493)][[2](https://github.com/apache/arrow-rs/pull/7987#issuecomment-3121763470)],
I think we prefer that
1. using iterator(or tmp buf) with splice, so that we won't mis-calculate
the size we spliced into the parent's buffer
2. we want to avoid tmp buf creation
For this and the benchmarks, I've some questions
1. for the `mis-calculated header` problem, can we use ut to cover the
header size calculation logic
2. for the `tmp buf creation` problem, from the pr in #7897 , it would have
a better performance if we use a tmp buf with `append_packed_u32`, is that the
tmp buffer creation cost may or may not the bottleneck, it depends on the
workloads we're on. if this is the case, do we need to add some more
benchmarks?(also do we need to bench for different header metadata such
as`is_large` is true for
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]