[I] Unexpected results with group by and random() [arrow-datafusion]

via GitHub Thu, 19 Oct 2023 18:23:40 -0700


Blajda opened a new issue, #7876:
URL: https://github.com/apache/arrow-datafusion/issues/7876


   ### Describe the bug
   
   I have table `t1` with a column called `file_path`
   I want to obtain a list of file_paths where each element is unique and then 
take a random subset of those columns.
   I thought that this could be achieved with the following code.
   
   ```rust
     let files = ctx.sql("select file_path from t1 group by 
file_path").await.unwrap()
         .with_column("r", random() ).unwrap()
         .filter(col("r").lt_eq(lit(0.2))).unwrap();
     files.show().await.unwrap();
   ```
   
   However in the output of my query I see the following entries which contains 
a record that should be filtered out.
   ```
   | A                    | 0.8023022275259943   |
   | B                    | 0.05829777789599211  |
   | C                    | 0.14330028518553894  |
   ```
   
   This is the calculated logical plan
   ```
   Projection: t1.file_path, random() AS r
       Aggregate: groupBy=[[t1.file_path]], aggr=[[]]
           Filter: random() <= Float64(0.2) 
              TableScan: t1 projection=[file_path]
   ```
   
   In this case I would expect the filter to occur after the aggregate 
operation not before.
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Unexpected results with group by and random() [arrow-datafusion]

Reply via email to