Github user MrBago commented on the issue:

    https://github.com/apache/spark/pull/19904
  
    @BryanCutler Thanks for the response, I think I understand the situation a 
little better here.
    
    I think there is a fundamental tradeoff we cannot avoid. Specifically there 
is a tradeoff between distributed memory usage and driver memory usage. A model 
cannot be garbage collected until its evaluation future is done running and the 
training data cannot be unpersisted until all of the training tasks are 
finished.
    
    Let's say we have 3 models to train and evaluate with `parallelism=1`, and 
let's call the training & evaluation tasks T1-3 and E1-3. If we schedule the 
tasks T1, T2, T3, unpersist, E1, E2, E3 then we can use less distributed memory 
but we cannot avoid holding all 3 models in memory. If we schedule the tasks 
T1, E1, T2, E2, T3, unpersist, E3 then we must use more distributed memory but 
we only ever need to hold 1 model in memory.
    
    I'm not sure how scala futures work, when T1 is done I don't know whether 
T2 or E1 get priority in the executor pool. Can we guarantee that the jvm will 
not schedule T1, T2, T3, E1, E2, E3, unpersist?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to