rluvaton commented on PR #18921:
URL: https://github.com/apache/datafusion/pull/18921#issuecomment-3944235156
thanks a lot for this PR.
Couple of things to make sure we support or have a way to add them in the
future without breaking changes:
1. Support `Map`, `(Large)List`, `(Large)ListView`, `FixedSizeList` as the
input for lambda
2. multiple lambdas in a single expression, for example
`map_key_value(some_map_col, map_key_lambda, map_value_lambda)` and each lambda
gets a different variables
3. Lambda expression that access columns that are not in the list itself
so I can do the following:
```
| year | grades |
|------|----------------|
| 1998 | [1, 2, 3] |
| 1999 | [4, 99, 5, 10] |
| 2000 | [6, 0, null] |
```
`array_transform(grades, x -> if year <= 1990 then x * 10 else x)`
4. optional arguments for lambda, for example the index of the item in the
list
the optional here is important as I want to avoid creating that input if
I don't need to.
5. Nested lambda expressions: `array_transform(matrix, x ->
array_transform(x, y -> y * 2))`
And some stuff that every lambda expression would need that we would need to
provide a helpers (and not fix the input IMO as it would be expensive and the
user might be able to have some prior knowledge on the input or just want their
own implementation, or the child lambda expression can't error)
> (I have helpers for all of these)
1. how we handle null lists when the underlying list is not empty and the
expression can fail, for example: `array_transform(list, x -> 1 / x)`
for example this input: which the second list is `null` but the
underlying value is `[0, 3]` which if we run the transform on it it will fail
with division by zero.
```rust
fn get_list() -> GenericListArray<i32> {
GenericListArray::new(
Arc::new(Field::new_list_field(DataType::Int8, false)),
OffsetBuffer::<i32>::from_lengths(vec![2, 2, 1]),
Arc::new(Int8Array::from(vec![1, 2, 0, 3, 4])),
Some(NullBuffer::from(&[true, false, true])),
)
}
```
I have a lot of helpers to cleanup the nulls BTW
2. How we handle sliced lists the child should only work on the sliced data.
-----
I think having a new `LambdaUDFImpl` is better than adding functions on
existing `ScalarUDF` because:
1. The `ScalarUDF` trait will not grow too much and make implementing
regular scalar UDFs easier or lambda overwhelming
2. what if we need to add a required function but only for lambda, we can
add it on the new trait with ease and we won't need to do some weird stuff
to avoid breaking changes.
3. Less ambiguity on the API.
----
I want to keep the simplicity of `ScalarUDF` which means that in order to
evaluate a lambda expression I don't need to construct stuff, only need to
provide the input and maybe some options for future use.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]