Re: [PR] [Variant] Optimize the object header generation logic in ObjectBuilder::finish [arrow-rs]

via GitHub Thu, 31 Jul 2025 05:31:29 -0700


klion26 commented on PR #8031:
URL: https://github.com/apache/arrow-rs/pull/8031#issuecomment-3139775558


   @alamb @scovich 
   
   I've tried to benchmark three implementations
   1. the current implementation with `PackedU32Iterator`
   2. the implementation with `append_packed_u32` into a tmp buf, then splice 
into the parent's buffer like we did for `ListBuilder::finish`
   
   <details>
   <summary>append_packed_u32 into a tmp buf</summary>
   
          let num_fileds_size = if is_large { 4 } else { 1 }; // is_large: 4 
bytes, else 1 byte.
           let header_size = 1 + // header byte (i.e., `object_header`)
               num_fileds_size + // num_fields_size
               (num_fields * id_size as usize) + // field IDs
               ((num_fields + 1) * offset_size as usize); // field offsets + 
data_size
           let mut bytes_to_splice = Vec::with_capacity(header_size + 3);
           // Write header byte
           let header = object_header(is_large, id_size, offset_size);
           bytes_to_splice.push(header);
           append_packed_u32(&mut bytes_to_splice, num_fields as u32, 
num_fileds_size);
           for field_id in self.fields.keys() {
               append_packed_u32(&mut bytes_to_splice, *field_id, id_size as 
usize);
           }
           for offset in self.fields.values() {
               append_packed_u32(&mut bytes_to_splice, *offset as u32, id_size 
as usize);
           }
           append_packed_u32(&mut bytes_to_splice, data_size as u32, 
offset_size as usize);
           let starting_offset = self.parent_value_offset_base;
           // Shift existing data to make room for the header
           let buffer = parent_buffer.inner_mut();
           buffer.splice(starting_offset..starting_offset, bytes_to_splice);
   
   </details>
   
   3. the implementation with using `PackedU32Iterator` and extend, then splice 
into the iterator into the parent's buffer
   
   <details>
   <summary> PackedU32Iterator with extend </summary>
   
           let num_fields_bytes = num_fields.to_le_bytes();
           let num_elements_bytes = 
num_fields_bytes.iter().take(num_fileds_size).copied();
           let fields = PackedU32Iterator::new(
               id_size as usize,
               self.fields.keys().map(|offset| offset.to_le_bytes()),
           );
           let offsets = PackedU32Iterator::new(
               offset_size as usize,
               self.fields
                   .values()
                   .map(|offset| (*offset as u32).to_le_bytes()),
           );
           let data_size_bytes = (data_size as u32).to_le_bytes();
           let data_size_bytes_iter = data_size_bytes.iter().take(offset_size 
as usize).copied();
           let header = object_header(is_large, id_size, offset_size);
           let mut bytes_to_splice = vec![header];
           bytes_to_splice.extend(num_elements_bytes);
           bytes_to_splice.extend(fields);
           bytes_to_splice.extend(offsets);
           bytes_to_splice.extend(data_size_bytes_iter);
           let starting_offset = self.parent_value_offset_base;
           // Shift existing data to make room for the header
           let buffer = parent_buffer.inner_mut();
           buffer.splice(starting_offset..starting_offset, bytes_to_splice);
   </details>
   
   The results show that the first win(but not too much).
   
   ## PackedU32Iterator
   ```
   group                                                                
7978_optimize_header_generatation_with_iterator    main
   -----                                                                
-----------------------------------------------    ----
   batch_json_string_to_variant json_list 8k string                     1.00    
 28.4±1.05ms        ? ?/sec                1.03     29.1±0.97ms        ? ?/sec
   batch_json_string_to_variant random_json(2633 bytes per document)    1.00    
324.6±8.35ms        ? ?/sec                1.01    327.6±8.18ms        ? ?/sec
   batch_json_string_to_variant repeated_struct 8k string               1.00    
 11.0±0.40ms        ? ?/sec                1.01     11.2±0.34ms        ? ?/sec
   variant_get_primitive                                                1.00  
1105.8±33.81µs        ? ?/sec                1.01  1121.0±43.28µs        ? ?/sec
   ```
   ## `append_packed_u32` into a tmp buf
   ```
   group                                                                
7978_optimize_header_generation_logic_object_builder    main
   -----                                                                
----------------------------------------------------    ----
   batch_json_string_to_variant json_list 8k string                     1.00    
 28.5±1.09ms        ? ?/sec                     1.02     29.1±0.97ms        ? 
?/sec
   batch_json_string_to_variant random_json(2633 bytes per document)    1.00    
328.9±8.20ms        ? ?/sec                     1.00    327.6±8.18ms        ? 
?/sec
   batch_json_string_to_variant repeated_struct 8k string               1.00    
 11.3±0.31ms        ? ?/sec                     1.00     11.2±0.34ms        ? 
?/sec
   variant_get_primitive                                                1.00  
1106.9±40.65µs        ? ?/sec                     1.01  1121.0±43.28µs        ? 
?/sec
   
   ```
   ## `PackedU32Iterator` with extend
   ```
   group                                                                
7978_optimizer_header_generation_with_extend    main
   -----                                                                
--------------------------------------------    ----
   batch_json_string_to_variant json_list 8k string                     1.04    
 30.2±1.82ms        ? ?/sec             1.00     29.1±0.97ms        ? ?/sec
   batch_json_string_to_variant random_json(2633 bytes per document)    1.06    
345.8±9.52ms        ? ?/sec             1.00    327.6±8.18ms        ? ?/sec
   batch_json_string_to_variant repeated_struct 8k string               1.12    
 12.5±0.77ms        ? ?/sec             1.00     11.2±0.34ms        ? ?/sec
   variant_get_primitive                                                1.00  
1123.7±50.11µs        ? ?/sec             1.00  1121.0±43.28µs        ? ?/sec
   ```
   
   From the conversation of previous 
prs[[1](https://github.com/apache/arrow-rs/pull/7987#pullrequestreview-3057148493)][[2](https://github.com/apache/arrow-rs/pull/7987#issuecomment-3121763470)],
 I think we prefer that
   1. using iterator(or tmp buf) with splice, so that we won't mis-calculate 
the size we spliced into the parent's buffer
   2. we want to avoid tmp buf creation
   
   For this and the benchmarks, I've some questions
   1. for the `mis-calculated header` problem, can we use ut to cover the 
header size calculation logic
   2. for the `tmp buf creation` problem, from the pr in #7897 , it would have 
a better performance if we use a tmp buf with `append_packed_u32`, is that the 
tmp buffer creation cost may or may not the bottleneck, it depends on the 
workloads we're on. if this is the case, do we need to add some more 
benchmarks?(also do we need to bench for different header metadata such 
as`is_large` is true for 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [Variant] Optimize the object header generation logic in ObjectBuilder::finish [arrow-rs]

Reply via email to