Baunsgaard commented on PR #1953:
URL: https://github.com/apache/systemds/pull/1953#issuecomment-1864745007

   @mboehm7 
   
   I finally found out what the issue is and propose a solution.
   
   The problem is runing tests with multiple federated workers in the same JVM. 
Therefore, we fail in cases where multiple workers read the same input matrix X 
in their individual partitions. This makes the buffer pool fail / timeout, with 
the latest changes and speedups from different sources (or at least that is my 
hypothesis).
   
   The fix is to change the workers to run in separate processes, this is 
anyway the use case for the federated workers so all good making the change. I 
will change  all the federated tests later today.
   
   Future work: We might want to consider testing multiprocessing congestion on 
our buffer pool and internal memory management for future usage of parallel 
execution of instructions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to