jacobtomlinson commented on issue #26669:
URL: https://github.com/apache/beam/issues/26669#issuecomment-1552877898

   After exploring this further there seem to be two separate things going on 
here.
   
   First is that Dask Bag has a limit on the number of partitions it creates by 
default. This limit is 100, but due to some integer rounding it will overshoot 
until it is a multiple of 100 and then collapse again. This is why we see a 
~50% performance drop between 199 items and 200 items, then a small increase 
before dropping again at 300, etc. I've opened 
https://github.com/dask/dask/pull/10294 to address this so that the number of 
tasks continues to grow with the number of items, just in a non-linear way.
   
   The second is that Beam is always using a Dask Distributed cluster, and at 
the time of execution has an instance if `dask.distributed.Client`. Therefore 
we could override the default partitioning scheme altogether and explicitly set 
the number of partitions to match the number of workers in the cluster. This 
would require Beam to somehow communicate this information to the `Create` node 
when it walks the pipeline.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to