Github user MrBago commented on the issue:
https://github.com/apache/spark/pull/19904
@BryanCutler Thanks for the response, I think I understand the situation a
little better here.
I think there is a fundamental tradeoff we cannot avoid. Specifically there
is a tradeoff between distributed memory usage and driver memory usage. A model
cannot be garbage collected until its evaluation future is done running and the
training data cannot be unpersisted until all of the training tasks are
finished.
Let's say we have 3 models to train and evaluate with `parallelism=1`, and
let's call the training & evaluation tasks T1-3 and E1-3. If we schedule the
tasks T1, T2, T3, unpersist, E1, E2, E3 then we can use less distributed memory
but we cannot avoid holding all 3 models in memory. If we schedule the tasks
T1, E1, T2, E2, T3, unpersist, E3 then we must use more distributed memory but
we only ever need to hold 1 model in memory.
I'm not sure how scala futures work, when T1 is done I don't know whether
T2 or E1 get priority in the executor pool. Can we guarantee that the jvm will
not schedule T1, T2, T3, E1, E2, E3, unpersist?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]