Baunsgaard commented on PR #1953: URL: https://github.com/apache/systemds/pull/1953#issuecomment-1864745007
@mboehm7 I finally found out what the issue is and propose a solution. The problem is runing tests with multiple federated workers in the same JVM. Therefore, we fail in cases where multiple workers read the same input matrix X in their individual partitions. This makes the buffer pool fail / timeout, with the latest changes and speedups from different sources (or at least that is my hypothesis). The fix is to change the workers to run in separate processes, this is anyway the use case for the federated workers so all good making the change. I will change all the federated tests later today. Future work: We might want to consider testing multiprocessing congestion on our buffer pool and internal memory management for future usage of parallel execution of instructions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org