jacobtomlinson commented on issue #26669: URL: https://github.com/apache/beam/issues/26669#issuecomment-1579019317
Quick update. The latest Dask release (`2023.5.1`) includes the fix to Dask Bag's partitioning heuristic which means that the number of partitions will continue to increase as more items are added to the bag. This partially works around the problem we are seeing here because in theory with enough items in the PCollection you can saturate any number of Dask workers. However in reality you may not actually have enough items in your dataset so it may not solve this problem for everyone. A complete fix for this would be for Beam to explicitly set the number of partitions in the bag equal to the number of workers in the cluster. Beam knows this information at execution time so it should be a case of passing this along. I started looking into proposing a fix for this but it seems the bag gets created lazily at a point in time where we don't yet have a Dask Distributed Client. I think I'm at the point where I need someone more familiar with Beam like @alxmrs to point me in the right direction of what to do next. cc @rabernat who may have a passing interest in this issue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
