[GitHub] [arrow-datafusion] jorgecarleitao opened a new issue #523: Number of output record batches for small datasets is large

GitBox Mon, 07 Jun 2021 22:03:32 -0700


jorgecarleitao opened a new issue #523:
URL: https://github.com/apache/arrow-datafusion/issues/523



   When running group bys on small datasets, we are emitting too many record 
batches. This is a regression over ref 2423ff0d .
   
   This is causing the tests for the Python bindings to fail when upgrading to 
321fda40.
   
   For example,
   
   ```
   batch = pyarrow.RecordBatch.from_arrays(
               [pyarrow.array([1, 2, 3]), pyarrow.array([4, 4, 6])],
               names=["a", "b"],
           )
           return ctx.create_dataframe([[batch]])
   ```
   
   with 
   
   ```
   df = df.aggregate([f.col("b")], [udaf(f.col("a"))])
   
   result = df.collect()
   ```
   
   is returning 4 record batches. I can't see a valid reason for a record batch 
of 3 slots to be split in 4 record batches.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] jorgecarleitao opened a new issue #523: Number of output record batches for small datasets is large

Reply via email to