[GitHub] [arrow-datafusion] Dandandan commented on issue #523: Number of output record batches for small datasets is large

GitBox Mon, 07 Jun 2021 22:26:01 -0700


Dandandan commented on issue #523:
URL: 
https://github.com/apache/arrow-datafusion/issues/523#issuecomment-856452504



   > When running group bys on small datasets, we are emitting too many record 
batches. This is a regression over ref 
[2423ff0](https://github.com/apache/arrow-datafusion/commit/2423ff0dd1fe9c0932c1cb8d1776efa3acd69554)
 .
   > 
   > This is causing the tests for the Python bindings to fail when upgrading 
to 
[321fda4](https://github.com/apache/arrow-datafusion/commit/321fda40a47bcc494c5d2511b6e8b02c9ea975b4).
   > 
   > For example,
   > 
   > ```
   > batch = pyarrow.RecordBatch.from_arrays(
   >             [pyarrow.array([1, 2, 3]), pyarrow.array([4, 4, 6])],
   >             names=["a", "b"],
   >         )
   >         return ctx.create_dataframe([[batch]])
   > ```
   > 
   > with
   > 
   > ```
   > df = df.aggregate([f.col("b")], [udaf(f.col("a"))])
   > 
   > result = df.collect()
   > ```
   > 
   > is returning 4 record batches. I can't see a valid reason for a record 
batch of 3 slots to be split in 4 record batches.
   
   I think that is due to (hash) partitioning in group by / join.
   which will split the rows into multiple partitions/batches.
   Generally I think the unit tests should not rely on the number of batches or 
order of rows when not sorted as DataFusion should be free to reorder however 
it wants.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan commented on issue #523: Number of output record batches for small datasets is large

Reply via email to