jacobtomlinson commented on issue #26669: URL: https://github.com/apache/beam/issues/26669#issuecomment-1552877898
After exploring this further there seem to be two separate things going on here. First is that Dask Bag has a limit on the number of partitions it creates by default. This limit is 100, but due to some integer rounding it will overshoot until it is a multiple of 100 and then collapse again. This is why we see a ~50% performance drop between 199 items and 200 items, then a small increase before dropping again at 300, etc. I've opened https://github.com/dask/dask/pull/10294 to address this so that the number of tasks continues to grow with the number of items, just in a non-linear way. The second is that Beam is always using a Dask Distributed cluster, and at the time of execution has an instance if `dask.distributed.Client`. Therefore we could override the default partitioning scheme altogether and explicitly set the number of partitions to match the number of workers in the cluster. This would require Beam to somehow communicate this information to the `Create` node when it walks the pipeline. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
