Andrew Lamb created ARROW-10945:
-----------------------------------
Summary: [Rust] [DataFusion] Allow User Defined Aggregates to
return multiple values / structs
Key: ARROW-10945
URL: https://issues.apache.org/jira/browse/ARROW-10945
Project: Apache Arrow
Issue Type: New Feature
Reporter: Andrew Lamb
Usecase:
I want to implement a user defined aggregate function that produces more than
one column ( logical values)
Specifically I am trying to implement the InfluxDB 'selector' functions
`first`, `last`, `min`, and `max` as DataFusion aggregate functions.
I can't use the built in aggregate functions in DataFusion as selector
functions aren't exactly like normal aggregate functions -- they return both
the actual aggregate value as well as a timestamp. In addition, `first` and
`last` pick a row in the value column based on the value in the timestamp
column.
After some investigation, I realize I can't elegantly use the built in user
defined aggregate framework in DataFusion either. As an example of what is
going on here, let's take
```
value | time
------+------
3 | 1000
2 | 2000
1 | 3000
```
The result of `last(value)` should be be two columns `1 | 3000` -- however,
modeling this as a DataFusion aggregate does not seem to be possible at this
time. Each aggregate function can return a single columnar value but we need
to return 2 (the `.value` and `.time` fields).
Ideally I was thinking that the UDF could produce a Struct (with named field
`value` and `time`) but the evaluate
function([code])(https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/mod.rs#L238))returns
a `ScalarValue` and at the moment they [don't have support for
Structs](https://github.com/apache/arrow/blob/master/rust/datafusion/src/scalar.rs#L44)
I suspect that we would also need to add support in DataFusion for selecting
fields from structs
See additional detail and context on
https://github.com/influxdata/influxdb_iox/issues/448#issuecomment-744601824
--
This message was sent by Atlassian Jira
(v8.3.4#803005)