klion26 commented on PR #8031: URL: https://github.com/apache/arrow-rs/pull/8031#issuecomment-3139775558
@alamb @scovich I've tried to benchmark three implementations 1. the current implementation with `PackedU32Iterator` 2. the implementation with `append_packed_u32` into a tmp buf, then splice into the parent's buffer like we did for `ListBuilder::finish` <details> <summary>append_packed_u32 into a tmp buf</summary> let num_fileds_size = if is_large { 4 } else { 1 }; // is_large: 4 bytes, else 1 byte. let header_size = 1 + // header byte (i.e., `object_header`) num_fileds_size + // num_fields_size (num_fields * id_size as usize) + // field IDs ((num_fields + 1) * offset_size as usize); // field offsets + data_size let mut bytes_to_splice = Vec::with_capacity(header_size + 3); // Write header byte let header = object_header(is_large, id_size, offset_size); bytes_to_splice.push(header); append_packed_u32(&mut bytes_to_splice, num_fields as u32, num_fileds_size); for field_id in self.fields.keys() { append_packed_u32(&mut bytes_to_splice, *field_id, id_size as usize); } for offset in self.fields.values() { append_packed_u32(&mut bytes_to_splice, *offset as u32, id_size as usize); } append_packed_u32(&mut bytes_to_splice, data_size as u32, offset_size as usize); let starting_offset = self.parent_value_offset_base; // Shift existing data to make room for the header let buffer = parent_buffer.inner_mut(); buffer.splice(starting_offset..starting_offset, bytes_to_splice); </details> 3. the implementation with using `PackedU32Iterator` and extend, then splice into the iterator into the parent's buffer <details> <summary> PackedU32Iterator with extend </summary> let num_fields_bytes = num_fields.to_le_bytes(); let num_elements_bytes = num_fields_bytes.iter().take(num_fileds_size).copied(); let fields = PackedU32Iterator::new( id_size as usize, self.fields.keys().map(|offset| offset.to_le_bytes()), ); let offsets = PackedU32Iterator::new( offset_size as usize, self.fields .values() .map(|offset| (*offset as u32).to_le_bytes()), ); let data_size_bytes = (data_size as u32).to_le_bytes(); let data_size_bytes_iter = data_size_bytes.iter().take(offset_size as usize).copied(); let header = object_header(is_large, id_size, offset_size); let mut bytes_to_splice = vec![header]; bytes_to_splice.extend(num_elements_bytes); bytes_to_splice.extend(fields); bytes_to_splice.extend(offsets); bytes_to_splice.extend(data_size_bytes_iter); let starting_offset = self.parent_value_offset_base; // Shift existing data to make room for the header let buffer = parent_buffer.inner_mut(); buffer.splice(starting_offset..starting_offset, bytes_to_splice); </details> The results show that the first win(but not too much). ## PackedU32Iterator ``` group 7978_optimize_header_generatation_with_iterator main ----- ----------------------------------------------- ---- batch_json_string_to_variant json_list 8k string 1.00 28.4±1.05ms ? ?/sec 1.03 29.1±0.97ms ? ?/sec batch_json_string_to_variant random_json(2633 bytes per document) 1.00 324.6±8.35ms ? ?/sec 1.01 327.6±8.18ms ? ?/sec batch_json_string_to_variant repeated_struct 8k string 1.00 11.0±0.40ms ? ?/sec 1.01 11.2±0.34ms ? ?/sec variant_get_primitive 1.00 1105.8±33.81µs ? ?/sec 1.01 1121.0±43.28µs ? ?/sec ``` ## `append_packed_u32` into a tmp buf ``` group 7978_optimize_header_generation_logic_object_builder main ----- ---------------------------------------------------- ---- batch_json_string_to_variant json_list 8k string 1.00 28.5±1.09ms ? ?/sec 1.02 29.1±0.97ms ? ?/sec batch_json_string_to_variant random_json(2633 bytes per document) 1.00 328.9±8.20ms ? ?/sec 1.00 327.6±8.18ms ? ?/sec batch_json_string_to_variant repeated_struct 8k string 1.00 11.3±0.31ms ? ?/sec 1.00 11.2±0.34ms ? ?/sec variant_get_primitive 1.00 1106.9±40.65µs ? ?/sec 1.01 1121.0±43.28µs ? ?/sec ``` ## `PackedU32Iterator` with extend ``` group 7978_optimizer_header_generation_with_extend main ----- -------------------------------------------- ---- batch_json_string_to_variant json_list 8k string 1.04 30.2±1.82ms ? ?/sec 1.00 29.1±0.97ms ? ?/sec batch_json_string_to_variant random_json(2633 bytes per document) 1.06 345.8±9.52ms ? ?/sec 1.00 327.6±8.18ms ? ?/sec batch_json_string_to_variant repeated_struct 8k string 1.12 12.5±0.77ms ? ?/sec 1.00 11.2±0.34ms ? ?/sec variant_get_primitive 1.00 1123.7±50.11µs ? ?/sec 1.00 1121.0±43.28µs ? ?/sec ``` From the conversation of previous prs[[1](https://github.com/apache/arrow-rs/pull/7987#pullrequestreview-3057148493)][[2](https://github.com/apache/arrow-rs/pull/7987#issuecomment-3121763470)], I think we prefer that 1. using iterator(or tmp buf) with splice, so that we won't mis-calculate the size we spliced into the parent's buffer 2. we want to avoid tmp buf creation For this and the benchmarks, I've some questions 1. for the `mis-calculated header` problem, can we use ut to cover the header size calculation logic 2. for the `tmp buf creation` problem, from the pr in #7897 , it would have a better performance if we use a tmp buf with `append_packed_u32`, is that the tmp buffer creation cost may or may not the bottleneck, it depends on the workloads we're on. if this is the case, do we need to add some more benchmarks?(also do we need to bench for different header metadata such as`is_large` is true for -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org