cisaacstern commented on issue #29365: URL: https://github.com/apache/beam/issues/29365#issuecomment-1847982978
As Brian discovered during our sync on this today, the Dask issues already exist: - https://github.com/dask/distributed/issues/4141 - https://github.com/dask/dask/issues/6723 > Encoding python strings for dask shuffle is non-determinstic somehow. This is indeed the root issue, resulting from the fact that Python's built-in `hash` is non-deterministic across processes for strings, matching our observations here. I just confirmed that using `cluster = LocalCluster(processes=False)` (which uses single-process worker threads) resolves this groupby issue. Of course, that does not reflect how the DaskRunner will be used in the wild, so we will need a fix for the above Dask issue(s) to move forward here. I will ping the Dask issues now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
