jorgecarleitao opened a new pull request #8118:
URL: https://github.com/apache/arrow/pull/8118


   (This PR removes a lot of code ^_^)
   
   (This PR is built on top of #8117)
   
   The core problem that this PR addresses is the construction of a 
`StructArray`, whose spec can be found 
[here](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
   
   The current API to build a `StructArray` of 4 entries of fixed type is (part 
of a test):
   
   ```rust
   let string_builder = StringBuilder::new(4);
   let int_builder = Int32Builder::new(4);
   
   let mut fields = Vec::new();
   let mut field_builders = Vec::new();
   fields.push(Field::new("f1", DataType::Utf8, false));
   field_builders.push(Box::new(string_builder) as Box<ArrayBuilder>);
   fields.push(Field::new("f2", DataType::Int32, false));
   field_builders.push(Box::new(int_builder) as Box<ArrayBuilder>);
   
   let mut builder = StructBuilder::new(fields, field_builders);
   assert_eq!(2, builder.num_fields());
   
   let string_builder = builder
       .field_builder::<StringBuilder>(0)
       .expect("builder at field 0 should be string builder");
   string_builder.append_value("joe").unwrap();
   string_builder.append_null().unwrap();
   string_builder.append_null().unwrap();
   string_builder.append_value("mark").unwrap();
   
   let int_builder = builder
       .field_builder::<Int32Builder>(1)
       .expect("builder at field 1 should be int builder");
   int_builder.append_value(1).unwrap();
   int_builder.append_value(2).unwrap();
   int_builder.append_null().unwrap();
   int_builder.append_value(4).unwrap();
   
   builder.append(true).unwrap();
   builder.append(true).unwrap();
   builder.append_null().unwrap();
   builder.append(true).unwrap();
   
   let arr = builder.finish();
   ```
   
   This PR's proposal for the same array:
   
   ```rust
   let strings: ArrayRef = Arc::new(StringArray::from(vec![
       Some("joe"),
       None,
       None,
       Some("mark"),
   ]));
   let ints: ArrayRef = Arc::new(Int32Array::from(vec![Some(1), Some(2), None, 
Some(4)]));
   
   let arr = StructArray::try_from(vec![("f1", strings.clone()), ("f2", 
ints.clone())]).unwrap();
   ```
   
   Note that:
   
   * There is no `Field`, only name: the attributes (type and nullability) are 
obtained from the `ArrayData`'s itself, and thus there a guarantee that the 
field's attributes are aligned with the Data.
   * The implementation is dynamically typed: the type is obtained from 
`Array::data_type`, instead of having to match Field's datatype to each field' 
builders
   * `Option` is used to specify whether the quantity is null or not
   
   The construction uses an OR on the entry's null bitmaps to decide whether 
the struct null bitmap is null at a given index. I.e. the third index of the 
example in [the 
spec](https://arrow.apache.org/docs/format/Columnar.html#struct-layout) is 
obtained by checking if all fields are null at that index.
   
   There is an edge case, that this constructor is unable to build (and the 
user needs to use the other `From`): a struct with a `0` at position X and all 
field's bitmap at position X to be `1`:
   
   ```
   # array of 1 entry:
   bitmap struct = [0]
   bitmap field1 = [1]
   bitmap field2 = [1]
   ```
   
   this is because, in this `TryFrom`, the bitmap of the struct is computed 
from a bitwise `or` of the field's entries.
   
   IMO this is a non-issue because a `null` in the struct already implies an 
`unspecified` value on every field and thus that field's value is already 
assumed to be undefined. However, this is important to mention as a round-trip 
with this case will fail: in the example above, `bitmap struct` will have a `1`.
   
   Finally, this has a performance improvement of 40%.
   
   <details>
     <summary>Benchmark results</summary>
   
   ```
   git checkout HEAD^ && cargo bench --bench array_from_vec -- 
struct_array_from_vec && git checkout no_builder1 && cargo bench --bench 
array_from_vec -- struct_array_from_vec
   ```
   
   ```
   struct_array_from_vec 128                                                    
                         
                           time:   [8.6816 us 8.8225 us 9.0221 us]
                           change: [-39.407% -38.457% -37.537%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     2 (2.00%) high mild
     3 (3.00%) high severe
   
   struct_array_from_vec 256                                                    
                         
                           time:   [11.218 us 11.448 us 11.707 us]
                           change: [-44.091% -42.210% -40.540%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   struct_array_from_vec 512                                                    
                         
                           time:   [16.402 us 17.445 us 18.638 us]
                           change: [-49.129% -46.775% -44.109%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 11 outliers among 100 measurements (11.00%)
     2 (2.00%) high mild
     9 (9.00%) high severe
   
   struct_array_from_vec 1024                                                   
                          
                           time:   [25.527 us 26.019 us 26.648 us]
                           change: [-46.410% -42.028% -37.256%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 10 outliers among 100 measurements (10.00%)
     5 (5.00%) high mild
     5 (5.00%) high severe
   ```
   </details>
   
   ## Final note:
   
   The general direction that I am heading with this is to minimize the usage 
of builders. My issue with builders is that they are statically typed and 
perform incremental changes, but almost all our operations are dynamically 
typed and in bulk: batch read, batch write, etc. As such, it is often faster 
(and much simpler from UX's perspective) to create a `Vec<Option<_>>` and use 
it to create an Arrow Array.
   
   FYI @nevi-me @andygrove @alamb 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to