tustvold commented on issue #6981:
URL:
https://github.com/apache/arrow-datafusion/issues/6981#issuecomment-1791105117
In response to a question asked in the ASF slack, here is my best attempt to
explain what is going on here
---
# Scalars vs Arrays
First in order to understand what is going on we need to understand the
relationship between SQL and arrow, and in particular how this relates to
scalars.
The arrow data model is concerned with arrays, these can be thought of as
columns in a tabular data model.
So an Int64Array might contain the data `[1, 2, 3, 4]`.
If you then wrote a SQL expression like
```
SQL> col_a + col_b
```
This will compute a new array `col_c` where `col_c[i] = col_a[i] + col_b[i]`
Now when writing SQL queries it is common to want to write something like
```
SQL> col_a + 3
```
In this case you want `col_c[i] = col_a[i] + 3`.
In this case `3` is a scalar value. Arrow only stores arrays not scalars and
so what do you do?
DF uses an ColumnarValue enumeration which contains either an array or a
ScalarValue
Upstream arrow-rs has a similar concept using a trait called `Datum`
Some kernels are able to optimise the case where one or other of the sides
is known to be a scalar, and so explicitly handle these special types.
In the case of other kernels, DF will expand the scalar value to an array.
In particular consider the expression
```
SQL> operation(a, 3)
```
Say operation only handles arrays, and `a` contains `[1, 2, 3, 4]`.
DF will call the underlying kernel with `operation([1, 2, 3, 4], [3, 3, 3,
3])
---
# ListArray Union
Now we can get back to talking about `array_union`
Consider the example SQL
```
SQL> array_union([1, 2, 3, 4], [5, 6, 7, 8])
```
Both arguments are scalars, and so DF will expand them to arrays as above,
although only to a single element in this case because both arguments are
actually scalar.
So the arguments to array_union kernel are actually 2-dimensional, i.e.
```
RUST> array_union([[1, 2, 3, 4]], [[5, 6, 7, 8]])
```
The kernel then wants to for each element of the input list arrays, in this
case a list of integers, perform the union operation
```
let l = l.as_list::<OffsetSize>();
let r = r.as_list::<OffsetSize>();
let out = ...
for (l_value, r_value) in l.iter().zip(r) {
out.push(union(l_value, r_value));
}
```
Now this is where it gets tricky, as not only do we need to deduplicate the
list values, but also incrementally build an output list array with the results.
One way this might be achieved is something like (not tested at all)
```
let l = l.as_list::<OffsetSize>();
let r = r.as_list::<OffsetSize>();
let converter = RowConverter::new(...);
let mut dedup = HashSet::new();
let l_values = converter.convert_values(l.values()).unwrap();
let r_values = converter.convert_values(r.values()).unwrap();
// Might be worth adding an upstream OffsetBufferBuilder
let mut offsets = Vec::<OffsetSize>::with_capacity(l.len() + 1);
offset.push(0.usize_as());
let mut rows = Vec::with_capacity(l_values.len() + r_values.len());
for (l_w, r_w) in l.offsets().windows(2).zip(r.offsets().windows(2)) {
let l_slice = l_w[0]..l_w[1]; // The slice of `l_values` comprising the
current list element
let r_slice = r_w[0]..r_w[2]; // The slice of `r_values` comprising the
current list element
l_slice.for_each(|i| dedup.insert(l_values.row(i)));
r_slice.for_each(|i| dedup.insert(r_values.row(i)));
rows.extend(dedup.iter());
offsets.append(rows.len().usize_as())
dedup.clear();
}
let values = converter.convert_rows(rows).unwrap();
let offsets = OffsetBuffer::new(offsets.into());
let nulls = NullBuffer::union(l.nulls(), r.nulls());
Ok(GenericListArray::<OffsetSize>::new(offsets, values, nulls))
```
This relies on the particulars of how ListArray is represented, which is
explained in a bit more detail
[here](https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/).
Unfortunately the operation being performed here is not a natural fit for a
selection kernel, because it is selecting particular groups of values from list
children, so unfortunately there isn't really a way to avoid dealing with the
reality of how these arrays are encoded.
Dealing with the offsets and values directly avoids fighting the borrow
checker for the lifetime of rows and avoids downcasting for each list element.
Additionally using RowConverter means this will generalise to lists of any
value type, including lists or structs, whilst avoiding having to generate
specialized code for each value type, which is a problematic patttern
(#https://github.com/apache/arrow-datafusion/issues/7988).
I hope this at least clarifies some things...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]