[ 
https://issues.apache.org/jira/browse/SUBMARINE-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cdmikechen updated SUBMARINE-857:
---------------------------------
    Target Version: 0.9.0

> [Umbrella] Support model management SDK in distributed scenerios
> ----------------------------------------------------------------
>
>                 Key: SUBMARINE-857
>                 URL: https://issues.apache.org/jira/browse/SUBMARINE-857
>             Project: Apache Submarine
>          Issue Type: Task
>            Reporter: Byron Hsu
>            Assignee: Byron Hsu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.6.0
>
>
> Submarine is a platform designed for distributed training, so its model 
> management SDK should be easier to use in distributed scenarios.
>  In a general distributed experiment, there are several workers training 
> together.
>  Our model management toolkit will support:
>  1. The workers in the same experiment will automatically direct their logs 
> to the same group in mlflow, so users can monitor multiple workers' info in 
> one graph.
>  2. When saving models, users do not need to store all the workers' because 
> some are replicated or redundant. Calling save_model in our toolkit, we will 
> apply the most efficient saving strategy under the hood, which can cost the 
> least space and time.
> The API design doc can be viewed here: 
> [https://hackmd.io/I6frSeZIQDaKQYK4nGCR5w?both]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@submarine.apache.org
For additional commands, e-mail: dev-h...@submarine.apache.org

Reply via email to