[
https://issues.apache.org/jira/browse/SPARK-41342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lu Wang updated SPARK-41342:
----------------------------
Description:
There is a clear trend for deep learning to go from single-machine to
distributed to scale/accelerate training. Adding a support for Distributed DL
solution on Spark will increase the power for spark and largely simplify the
distributed DL workload for the users.
Currently,
[spark-tensorflow-distributor|https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-distributor]
provides a solution to run distributed Tensorflow on spark clusters.But there
is no such support for distributed PyTorch.
We want to add a general framework to support both DL frameworks so that we can
have a unified interface for distributed DL workload on spark. And it can take
the advantages for GPU scheduling on spark and have a better resource
management too.
> Add support for distributed deep learning framework
> ---------------------------------------------------
>
> Key: SPARK-41342
> URL: https://issues.apache.org/jira/browse/SPARK-41342
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 3.3.2
> Reporter: Lu Wang
> Priority: Major
>
> There is a clear trend for deep learning to go from single-machine to
> distributed to scale/accelerate training. Adding a support for Distributed DL
> solution on Spark will increase the power for spark and largely simplify the
> distributed DL workload for the users.
> Currently,
> [spark-tensorflow-distributor|https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-distributor]
> provides a solution to run distributed Tensorflow on spark clusters.But
> there is no such support for distributed PyTorch.
> We want to add a general framework to support both DL frameworks so that we
> can have a unified interface for distributed DL workload on spark. And it can
> take the advantages for GPU scheduling on spark and have a better resource
> management too.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]