[ 
https://issues.apache.org/jira/browse/SUBMARINE-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Byron Hsu updated SUBMARINE-857:
--------------------------------
    Description: 
Submarine is a platform designed for distributed training, so its model 
management SDK should be easier to use in distributed scenarios.
 In a general distributed experiment, there are several workers training 
together.
 Our model management toolkit will support:
 1. The workers in the same experiment will automatically direct their logs to 
the same group in mlflow, so users can monitor multiple workers' info in one 
graph.
 2. When saving models, users do not need to store all the workers' because 
some are replicated or redundant. Calling save_model in our toolkit, we will 
apply the most efficient saving strategy under the hood, which can cost the 
least space and time.

The API design doc can be viewed here: 
https://hackmd.io/I6frSeZIQDaKQYK4nGCR5w?both

  was:
Submarine is a platform designed for distributed training, so its model 
management SDK should be easier to use in distributed scenarios.
In a general distributed experiment, there are several workers training 
together.
Our model management toolkit will support:
1. The workers in the same experiment will automatically direct their logs to 
the same group in mlflow, so users can monitor multiple workers' info in one 
graph.
2. When saving models, users do not need to store all the workers' because some 
are replicated or redundant. Calling save_model in our toolkit, we will apply 
the most efficient saving strategy under the hood, which can cost the least 
space and time.


> [Umbrella] Support model management SDK in distributed scenerios
> ----------------------------------------------------------------
>
>                 Key: SUBMARINE-857
>                 URL: https://issues.apache.org/jira/browse/SUBMARINE-857
>             Project: Apache Submarine
>          Issue Type: Task
>            Reporter: Byron Hsu
>            Priority: Major
>
> Submarine is a platform designed for distributed training, so its model 
> management SDK should be easier to use in distributed scenarios.
>  In a general distributed experiment, there are several workers training 
> together.
>  Our model management toolkit will support:
>  1. The workers in the same experiment will automatically direct their logs 
> to the same group in mlflow, so users can monitor multiple workers' info in 
> one graph.
>  2. When saving models, users do not need to store all the workers' because 
> some are replicated or redundant. Calling save_model in our toolkit, we will 
> apply the most efficient saving strategy under the hood, which can cost the 
> least space and time.
> The API design doc can be viewed here: 
> https://hackmd.io/I6frSeZIQDaKQYK4nGCR5w?both



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to