jorgecarleitao opened a new issue #523:
URL: https://github.com/apache/arrow-datafusion/issues/523


   When running group bys on small datasets, we are emitting too many record 
batches. This is a regression over ref 2423ff0d .
   
   This is causing the tests for the Python bindings to fail when upgrading to 
321fda40.
   
   For example,
   
   ```
   batch = pyarrow.RecordBatch.from_arrays(
               [pyarrow.array([1, 2, 3]), pyarrow.array([4, 4, 6])],
               names=["a", "b"],
           )
           return ctx.create_dataframe([[batch]])
   ```
   
   with 
   
   ```
   df = df.aggregate([f.col("b")], [udaf(f.col("a"))])
   
   result = df.collect()
   ```
   
   is returning 4 record batches. I can't see a valid reason for a record batch 
of 3 slots to be split in 4 record batches.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to