yuanzac commented on a change in pull request #143: SUBMARINE-333. Docs of submarine server deployment URL: https://github.com/apache/submarine/pull/143#discussion_r365754483
########## File path: docs/design/submarine-server/jobspec.md ########## @@ -0,0 +1,100 @@ +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Generic Job Spec + +## Motivation +As the machine learning platform, the submarine should support multiple machine learning framework, such as Tensorflow, Pytorch etc. But different framework has different distributed components for the training job. So that we designed a generic job spec to abstract the training job across different frameworks. In this way, the submarine-server can hide the complexity of underlying infrastructure differences and provide a cleaner interface to manager jobs + +## Proposal +Considering the Tensorflow and Pytorch framework, we proposal one spec which consists of library spec, submitter spec and task specs etc. Such as: +```yaml +name: "mnist" +librarySpec: + name: "TensorFlow" + version: "2.1.0" + image: "gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0" + cmd: "python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate=0.01 --batch_size=150" + envVars: + ENV_1: "ENV1" +submitterSpec: + type: "k8s" + configPath: + namespace: "submarine" + kind: "TFJob" + apiVersion: "kubeflow.org/v1" +taskSpecs: + Ps: + name: tensorflow + replicas: 2 + resources: "cpu=4,memory=2048M,nvidia.com/gpu=1" + Worker: + name: tensorflow + replicas: 2 + resources: "cpu=4,memory=2048M,nvidia.com/gpu=1" +``` + +### Library Spec +The library spec describe the info about machine learning framework. All the fields as below: + +| field | type | optional | description | +|---|---|---|---| +| name | string | NO | Machine Learning Framework name. Such as: TensorFlow/PyTorch etc. | +| version | string | NO | The version of ML framework. Such as: 2.1.0 | +| image | string | NO | The public image used for each task if not specified. Such as: apache/submarine | +| cmd | string | YES | The public entry cmd for the task if not specified. | +| envVars | key/value | YES | The public env vars for the task if not specified. | + +### Submitter Spec +It describe the info of submitter which the user spcified, such as yarn, yarnservice or k8s. All the fields as below: Review comment: describe to describes ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
