Re: [I] Implement `array_union` function [arrow-datafusion]

via GitHub Thu, 02 Nov 2023 09:53:27 -0700


tustvold commented on issue #6981:
URL: 
https://github.com/apache/arrow-datafusion/issues/6981#issuecomment-1791105117


   In response to a question asked in the ASF slack, here is my best attempt to 
explain what is going on here
   
   ---
   # Scalars vs Arrays
   
   First in order to understand what is going on we need to understand the 
relationship between SQL and arrow, and in particular how this relates to 
scalars.
   
   The arrow data model is concerned with arrays, these can be thought of as 
columns in a tabular data model.
   
   So an Int64Array might contain the data `[1, 2, 3, 4]`.
   
   If you then wrote a SQL expression like
   
   ```
   SQL> col_a + col_b
   ```
   
   This will compute a new array `col_c` where `col_c[i] = col_a[i] + col_b[i]`
   
   Now when writing SQL queries it is common to want to write something like
   
   ```
   SQL> col_a + 3
   ```
   
   In this case you want `col_c[i] = col_a[i] + 3`.
   
   In this case `3` is a scalar value. Arrow only stores arrays not scalars and 
so what do you do?
   
   DF uses an ColumnarValue enumeration which contains either an array or a 
ScalarValue 
   
   Upstream arrow-rs has a similar concept using a trait called `Datum`
   
   Some kernels are able to optimise the case where one or other of the sides 
is known to be a scalar, and so explicitly handle these special types.
   
   In the case of other kernels, DF will expand the scalar value to an array. 
   
   In particular consider the expression
   
   ```
   SQL> operation(a, 3)
   ```
   
   Say operation only handles arrays, and `a` contains `[1, 2, 3, 4]`. 
   
   DF will call the underlying kernel with `operation([1, 2, 3, 4], [3, 3, 3, 
3])
   
   ---
   
   # ListArray Union
   
   Now we can get back to talking about `array_union`
   
   Consider the example SQL
   
   ```
   SQL> array_union([1, 2, 3, 4], [5, 6, 7, 8])
   ```
   
   Both arguments are scalars, and so DF will expand them to arrays as above, 
although only to a single element in this case because both arguments are 
actually scalar.
   
   So the arguments to array_union kernel are actually 2-dimensional, i.e.
   
   ```
   RUST> array_union([[1, 2, 3, 4]], [[5, 6, 7, 8]])
   ```
   
   The kernel then wants to for each element of the input list arrays, in this 
case a list of integers, perform the union operation
   
   ```
   let l = l.as_list::<OffsetSize>();
   let r = r.as_list::<OffsetSize>();
   let out = ...
   for (l_value, r_value) in l.iter().zip(r) {
       out.push(union(l_value, r_value));
   }
   ```
   
   Now this is where it gets tricky, as not only do we need to deduplicate the 
list values, but also incrementally build an output list array with the results.
   
   One way this might be achieved is something like (not tested at all)
   
   ```
   let l = l.as_list::<OffsetSize>();
   let r = r.as_list::<OffsetSize>();
   
   let converter = RowConverter::new(...);
   let mut dedup = HashSet::new();
   
   let l_values = converter.convert_values(l.values()).unwrap();
   let r_values = converter.convert_values(r.values()).unwrap();
   
   // Might be worth adding an upstream OffsetBufferBuilder
   let mut offsets = Vec::<OffsetSize>::with_capacity(l.len() + 1);
   offset.push(0.usize_as());
   let mut rows = Vec::with_capacity(l_values.len() + r_values.len());
   
   for (l_w, r_w) in l.offsets().windows(2).zip(r.offsets().windows(2)) {
       let l_slice = l_w[0]..l_w[1]; // The slice of `l_values` comprising the 
current list element
       let r_slice = r_w[0]..r_w[2]; // The slice of `r_values` comprising the 
current list element
       
       l_slice.for_each(|i| dedup.insert(l_values.row(i)));
       r_slice.for_each(|i| dedup.insert(r_values.row(i)));
       
       rows.extend(dedup.iter());
       offsets.append(rows.len().usize_as())
       dedup.clear();
   }
   
   let values = converter.convert_rows(rows).unwrap();
   let offsets = OffsetBuffer::new(offsets.into());
   let nulls = NullBuffer::union(l.nulls(), r.nulls());
   
   Ok(GenericListArray::<OffsetSize>::new(offsets, values, nulls))
   ```
   
   This relies on the particulars of how ListArray is represented, which is 
explained in a bit more detail 
[here](https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/).
   
   Unfortunately the operation being performed here is not a natural fit for a 
selection kernel, because it is selecting particular groups of values from list 
children, so unfortunately there isn't really a way to avoid dealing with the 
reality of how these arrays are encoded.
   
   Dealing with the offsets and values directly avoids fighting the borrow 
checker for the lifetime of rows and avoids downcasting for each list element. 
Additionally using RowConverter means this will generalise to lists of any 
value type, including lists or structs, whilst avoiding having to generate 
specialized code for each value type, which is a problematic patttern 
(#https://github.com/apache/arrow-datafusion/issues/7988).
   
   I hope this at least clarifies some things...
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Implement `array_union` function [arrow-datafusion]

Reply via email to