Re: [I] Max batch size for Dataset [arrow]

via GitHub Tue, 19 Mar 2024 03:10:16 -0700


NikolayKosarevO9 commented on issue #40576:
URL: https://github.com/apache/arrow/issues/40576#issuecomment-2006639406


   @amoeba I've noticed this magic number appearing in multiple places within 
the source code, but I'm not sure which occurrence is the actual cause of the 
issue. When I set the batch size to less than 1048576, it is adhered to; 
however, any number above that is capped at 1048576.
   
   In my use case, I'm dealing (locally, on a single big-ass machine) with a 
large number of huge parquet files as input, each containing over 1 billion 
records, and these files need to be re-partitioned based on the values in 
certain columns. The number of partitions can be as many as 5,000. Currently, 
if the data is processed in batches of 1 million (due to the cap), this results 
in a very large number of files per partition, while each file remains tiny. 
Consequently, the overall performance drops to below acceptable levels.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Max batch size for Dataset [arrow]

Reply via email to