[ https://issues.apache.org/jira/browse/SUBMARINE-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
cdmikechen updated SUBMARINE-857: --------------------------------- Target Version: 0.9.0 > [Umbrella] Support model management SDK in distributed scenerios > ---------------------------------------------------------------- > > Key: SUBMARINE-857 > URL: https://issues.apache.org/jira/browse/SUBMARINE-857 > Project: Apache Submarine > Issue Type: Task > Reporter: Byron Hsu > Assignee: Byron Hsu > Priority: Major > Labels: pull-request-available > Fix For: 0.6.0 > > > Submarine is a platform designed for distributed training, so its model > management SDK should be easier to use in distributed scenarios. > In a general distributed experiment, there are several workers training > together. > Our model management toolkit will support: > 1. The workers in the same experiment will automatically direct their logs > to the same group in mlflow, so users can monitor multiple workers' info in > one graph. > 2. When saving models, users do not need to store all the workers' because > some are replicated or redundant. Calling save_model in our toolkit, we will > apply the most efficient saving strategy under the hood, which can cost the > least space and time. > The API design doc can be viewed here: > [https://hackmd.io/I6frSeZIQDaKQYK4nGCR5w?both] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@submarine.apache.org For additional commands, e-mail: dev-h...@submarine.apache.org