Dandandan commented on issue #523:
URL: 
https://github.com/apache/arrow-datafusion/issues/523#issuecomment-856452504


   > When running group bys on small datasets, we are emitting too many record 
batches. This is a regression over ref 
[2423ff0](https://github.com/apache/arrow-datafusion/commit/2423ff0dd1fe9c0932c1cb8d1776efa3acd69554)
 .
   > 
   > This is causing the tests for the Python bindings to fail when upgrading 
to 
[321fda4](https://github.com/apache/arrow-datafusion/commit/321fda40a47bcc494c5d2511b6e8b02c9ea975b4).
   > 
   > For example,
   > 
   > ```
   > batch = pyarrow.RecordBatch.from_arrays(
   >             [pyarrow.array([1, 2, 3]), pyarrow.array([4, 4, 6])],
   >             names=["a", "b"],
   >         )
   >         return ctx.create_dataframe([[batch]])
   > ```
   > 
   > with
   > 
   > ```
   > df = df.aggregate([f.col("b")], [udaf(f.col("a"))])
   > 
   > result = df.collect()
   > ```
   > 
   > is returning 4 record batches. I can't see a valid reason for a record 
batch of 3 slots to be split in 4 record batches.
   
   I think that is due to (hash) partitioning in group by / join.
   which will split the rows into multiple partitions/batches.
   Generally I think the unit tests should not rely on the number of batches or 
order of rows when not sorted as DataFusion should be free to reorder however 
it wants.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to