jacobtomlinson commented on issue #26669:
URL: https://github.com/apache/beam/issues/26669#issuecomment-1579019317

   Quick update. The latest Dask release (`2023.5.1`) includes the fix to Dask 
Bag's partitioning heuristic which means that the number of partitions will 
continue to increase as more items are added to the bag. This partially works 
around the problem we are seeing here because in theory with enough items in 
the PCollection you can saturate any number of Dask workers. However in reality 
you may not actually have enough items in your dataset so it may not solve this 
problem for everyone.
   
   A complete fix for this would be for Beam to explicitly set the number of 
partitions in the bag equal to the number of workers in the cluster. Beam knows 
this information at execution time so it should be a case of passing this 
along. I started looking into proposing a fix for this but it seems the bag 
gets created lazily at a point in time where we don't yet have a Dask 
Distributed Client. I think I'm at the point where I need someone more familiar 
with Beam like @alxmrs to point me in the right direction of what to do next.
   
   cc @rabernat who may have a passing interest in this issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to