Gentoli opened a new issue, #4030:
URL: https://github.com/apache/arrow-datafusion/issues/4030
**Describe the bug**
Filtering results from a ScalarUDF results in it being called twice.
**To Reproduce**
``` rust
ctx.register_csv("csv", "/test.csv", CsvReadOptions::new()).await.unwrap();
let udf = {
create_udf(
"rand_bool",
vec![DataType::Float32],
Arc::new(DataType::Boolean),
Volatility::Stable,
make_scalar_function(|a| {
const BOOLS: [bool; 4] = [true, true, false, false];
let x = a.first().unwrap();
println!("udf in: {x:?}");
Ok(Arc::new(BooleanArray::from(Vec::from(&BOOLS[..x.len()]))) as
ArrayRef)
}),
)
};
ctx.register_udf(udf.clone());
let query = ctx.table("csv").unwrap()
.select(vec![
Expr::Wildcard,
udf.call(vec![col("num")]).alias("rand"),
]).unwrap()
.filter(col("rand").eq(lit(false))).unwrap();
query.show_limit(10).await.unwrap();
query.explain(false, false).unwrap().show().await.unwrap();
```
Same happens with SQL:
``` sql
SELECT * FROM (SELECT *, rand_bool(num) AS rand FROM csv) WHERE NOT rand
```
The UDF is not so "stable". Regardless it should not be called twice (prints
`udf in: PrimitiveArray ...` twice). And the results can actually return `true`
when the filter is false.
<details>
<summary>Output + Test files</summary>
Output:
```
udf in: PrimitiveArray<Float32>
[
100.0,
200.0,
150.0,
300.0,
]
udf in: PrimitiveArray<Float32>
[
150.0,
300.0,
]
+--------+-----+------+
| name_1 | num | rand |
+--------+-----+------+
| andy | 150 | true |
| paul | 300 | true |
+--------+-----+------+
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: csv.name_1, csv.num, rand_bool(CAST(csv.num AS
Float32)) AS rand
|
| | Filter: NOT rand_bool(CAST(csv.num AS Float32))
|
| | TableScan: csv projection=[name_1, num],
partial_filters=[NOT rand_bool(CAST(csv.num AS Float32))]
|
| physical_plan | ProjectionExec: expr=[name_1@0 as name_1, num@1 as num,
rand_bool(CAST(num@1 AS Float32)) as rand]
|
| | CoalesceBatchesExec: target_batch_size=4096
|
| | FilterExec: NOT rand_bool(CAST(num@1 AS Float32))
|
| | RepartitionExec: partitioning=RoundRobinBatch(12)
|
| | CsvExec: files=[<>/test.csv], has_header=true,
limit=None, projection=[name_1, num] |
| |
|
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------+
```
test.csv
``` csv
name_1,num
andrew,100
jorge,200
andy,150
paul,300
```
</details>
**Expected behavior**
udf should be projected first then filtered.
**Additional context**
Running @ master (d391b859c44e1c366eb4da5e8cabd199336f4243)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]