[GitHub] [arrow-datafusion] Gentoli opened a new issue, #4030: ScalarUDF called twice when using filter on UDF column

GitBox Sun, 30 Oct 2022 06:35:57 -0700


Gentoli opened a new issue, #4030:
URL: https://github.com/apache/arrow-datafusion/issues/4030


   **Describe the bug**
   Filtering results from a ScalarUDF results in it being called twice.
   
   **To Reproduce**
   
   ``` rust
   
   ctx.register_csv("csv", "/test.csv", CsvReadOptions::new()).await.unwrap();
   
   
   let udf = {
       create_udf(
           "rand_bool",
           vec![DataType::Float32],
           Arc::new(DataType::Boolean),
           Volatility::Stable,
           make_scalar_function(|a| {
               const BOOLS: [bool; 4] = [true, true, false, false];
   
               let x = a.first().unwrap();
               println!("udf in: {x:?}");
   
               Ok(Arc::new(BooleanArray::from(Vec::from(&BOOLS[..x.len()]))) as 
ArrayRef)
           }),
       )
   };
   
   ctx.register_udf(udf.clone());
   
   let query = ctx.table("csv").unwrap()
       .select(vec![
           Expr::Wildcard,
           udf.call(vec![col("num")]).alias("rand"),
       ]).unwrap()
       .filter(col("rand").eq(lit(false))).unwrap();
   
   
   query.show_limit(10).await.unwrap();
   query.explain(false, false).unwrap().show().await.unwrap();
   
   ```
   Same happens with SQL: 
   ``` sql
   SELECT * FROM (SELECT *, rand_bool(num) AS rand FROM csv) WHERE NOT rand
   ```
   
   The UDF is not so "stable". Regardless it should not be called twice (prints 
`udf in: PrimitiveArray ...` twice). And the results can actually return `true` 
when the filter is false.
   
   <details>
     <summary>Output + Test files</summary>
   
   Output:
   ```
   udf in: PrimitiveArray<Float32>
   [
     100.0,
     200.0,
     150.0,
     300.0,
   ]
   udf in: PrimitiveArray<Float32>
   [
     150.0,
     300.0,
   ]
   +--------+-----+------+
   | name_1 | num | rand |
   +--------+-----+------+
   | andy   | 150 | true |
   | paul   | 300 | true |
   +--------+-----+------+
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
    |
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: csv.name_1, csv.num, rand_bool(CAST(csv.num AS 
Float32)) AS rand                                                               
    |
   |               |   Filter: NOT rand_bool(CAST(csv.num AS Float32))          
                                                                                
    |
   |               |     TableScan: csv projection=[name_1, num], 
partial_filters=[NOT rand_bool(CAST(csv.num AS Float32))]                       
                  |
   | physical_plan | ProjectionExec: expr=[name_1@0 as name_1, num@1 as num, 
rand_bool(CAST(num@1 AS Float32)) as rand]                                      
       |
   |               |   CoalesceBatchesExec: target_batch_size=4096              
                                                                                
    |
   |               |     FilterExec: NOT rand_bool(CAST(num@1 AS Float32))      
                                                                                
    |
   |               |       RepartitionExec: partitioning=RoundRobinBatch(12)    
                                                                                
    |
   |               |         CsvExec: files=[<>/test.csv], has_header=true, 
limit=None, projection=[name_1, num] |
   |               |                                                            
                                                                                
    |
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------+
   
   ```
   
   test.csv
   ``` csv
   name_1,num
   andrew,100
   jorge,200
   andy,150
   paul,300
   ```
   
   </details>
   
   **Expected behavior**
   udf should be projected first then filtered.
   
   **Additional context**
   Running @ master (d391b859c44e1c366eb4da5e8cabd199336f4243)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Gentoli opened a new issue, #4030: ScalarUDF called twice when using filter on UDF column

Reply via email to