This is an automated email from the ASF dual-hosted git repository.
pingsutw pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/submarine.git
The following commit(s) were added to refs/heads/master by this push:
new 8eb3e20 SUBMARINE-1118. Remove relevant yarn pages in the
documentation
8eb3e20 is described below
commit 8eb3e20a034ffb86aeb4328a46b22cb621ea0a90
Author: woodcutter-eric <[email protected]>
AuthorDate: Sun Dec 5 17:39:36 2021 +0800
SUBMARINE-1118. Remove relevant yarn pages in the documentation
### What is this PR for?
<!-- A few sentences describing the overall goals of the pull request's
commits.
First time? Check out the contributing guide -
https://submarine.apache.org/contribution/contributions.html
-->
Remove outdated yarn documentation.
### What type of PR is it?
[ Documentation ]
### Todos
* [ ] - Task
### What is the Jira issue?
<!-- * Open an issue on Jira
https://issues.apache.org/jira/browse/SUBMARINE/
* Put link here, and add [SUBMARINE-*Jira number*] in PR title, eg.
`SUBMARINE-23. PR title`
-->
https://issues.apache.org/jira/projects/SUBMARINE/issues/SUBMARINE-1118?filter=myopenissues
### How should this be tested?
<!--
* First time? Setup Travis CI as described on
https://submarine.apache.org/contribution/contributions.html#continuous-integration
* Strongly recommended: add automated unit tests for any new or changed
behavior
* Outline any manual steps to test the PR here.
-->
### Screenshots (if appropriate)
### Questions:
* Do the license files need updating? No
* Are there breaking changes for older versions? No
* Does this need new documentation? No
Author: woodcutter-eric <[email protected]>
Signed-off-by: Kevin <[email protected]>
Closes #818 from woodcutter-eric/SUBMARINE-1118 and squashes the following
commits:
c2156583 [woodcutter-eric] SUBMARINE-1118. Remove some yarn description in
the documentation
---
website/docs/adminDocs/yarn/README.md | 2 +-
.../designDocs/architecture-and-requirements.md | 106 ++++++++++-----------
.../docs/designDocs/experiment-implementation.md | 100 +++++++++----------
website/docs/designDocs/implementation-notes.md | 6 +-
.../designDocs/submarine-server/architecture.md | 56 +++++------
.../designDocs/submarine-server/experimentSpec.md | 6 +-
.../designDocs/wip-designs/submarine-launcher.md | 39 ++++----
website/docs/devDocs/README.md | 4 +-
8 files changed, 158 insertions(+), 161 deletions(-)
diff --git a/website/docs/adminDocs/yarn/README.md
b/website/docs/adminDocs/yarn/README.md
index 50fb134..cb5932c 100644
--- a/website/docs/adminDocs/yarn/README.md
+++ b/website/docs/adminDocs/yarn/README.md
@@ -1,5 +1,5 @@
---
-title: Running Submarine on YARN
+title: Running Submarine on YARN (deprecated)
---
<!--
diff --git a/website/docs/designDocs/architecture-and-requirements.md
b/website/docs/designDocs/architecture-and-requirements.md
index 3aac849..7041142 100644
--- a/website/docs/designDocs/architecture-and-requirements.md
+++ b/website/docs/designDocs/architecture-and-requirements.md
@@ -26,49 +26,49 @@ title: Architecture and Requirment
| Admin | Also called SRE, who manages user's quotas, credentials, team, and
other components. |
-## Background
+## Background
-Everybody talks about machine learning today, and lots of companies are trying
to leverage machine learning to push the business to the next level. Nowadays,
as more and more developers, infrastructure software companies coming to this
field, machine learning becomes more and more achievable.
+Everybody talks about machine learning today, and lots of companies are trying
to leverage machine learning to push the business to the next level. Nowadays,
as more and more developers, infrastructure software companies coming to this
field, machine learning becomes more and more achievable.
-In the last decade, the software industry has built many open source tools for
machine learning to solve the pain points:
+In the last decade, the software industry has built many open source tools for
machine learning to solve the pain points:
1. It was not easy to build machine learning algorithms manually, such as
logistic regression, GBDT, and many other algorithms:
- **Answer to that:** Industries have open sourced many algorithm libraries,
tools, and even pre-trained models so that data scientists can directly reuse
these building blocks to hook up to their data without knowing intricate
details inside these algorithms and models.
+ **Answer to that:** Industries have open sourced many algorithm libraries,
tools, and even pre-trained models so that data scientists can directly reuse
these building blocks to hook up to their data without knowing intricate
details inside these algorithms and models.
-2. It was not easy to achieve "WYSIWYG, what you see is what you get" from
IDEs: not easy to get output, visualization, troubleshooting experiences at the
same place.
+2. It was not easy to achieve "WYSIWYG, what you see is what you get" from
IDEs: not easy to get output, visualization, troubleshooting experiences at the
same place.
**Answer to that:** Notebooks concept was added to this picture, notebook
brought the experiences of interactive coding, sharing, visualization,
debugging under the same user interface. There're popular open-source notebooks
like Apache Zeppelin/Jupyter.
-
-3. It was not easy to manage dependencies: ML applications can run on one
machine is hard to deploy on another machine because it has lots of libraries
dependencies.
- **Answer to that:** Containerization becomes popular and a standard to
packaging dependencies to make it easier to "build once, run anywhere".
+
+3. It was not easy to manage dependencies: ML applications can run on one
machine is hard to deploy on another machine because it has lots of libraries
dependencies.
+ **Answer to that:** Containerization becomes popular and a standard to
packaging dependencies to make it easier to "build once, run anywhere".
4. Fragmented tools, libraries were hard for ML engineers to learn.
Experiences learned in one company are not naturally migratable to another
company.
**Answer to that:** A few dominant open-source frameworks reduced the
overhead of learning too many different frameworks, concepts. Data-scientist
can learn a few libraries such as Tensorflow/PyTorch, and a few high-level
wrappers like Keras will be able to create your machine learning application
from other open-source building blocks.
5. Similarly, models built by one library (such as libsvm) were hard to be
integrated into machine learning pipeline since there's no standard format.
**Answer to that:** Industry has built successful open-source standard
machine learning frameworks such as Tensorflow/PyTorch/Keras so their format
can be easily shared across. And efforts to build an even more general model
format such as ONNX.
-
-6. It was hard to build a data pipeline that flows/transform data from a raw
data source to whatever required by ML applications.
+
+6. It was hard to build a data pipeline that flows/transform data from a raw
data source to whatever required by ML applications.
**Answer to that:** Open source big data industry plays an important role
in providing, simplify, unify processes and building blocks for data flows,
transformations, etc.
-The machine learning industry is moving on the right track to solve major
roadblocks. So what are the pain points now for companies which have machine
learning needs? What can we help here? To answer this question, let's look at
machine learning workflow first.
+The machine learning industry is moving on the right track to solve major
roadblocks. So what are the pain points now for companies which have machine
learning needs? What can we help here? To answer this question, let's look at
machine learning workflow first.
## Machine Learning Workflows & Pain points
```
1) From different data sources such as edge, clickstream, logs, etc.
- => Land to data lakes
-
-2) From data lake, data transformation:
- => Data transformations: Cleanup, remove invalid rows/columns,
+ => Land to data lakes
+
+2) From data lake, data transformation:
+ => Data transformations: Cleanup, remove invalid rows/columns,
select columns, sampling, split train/test
data-set, join table, etc.
=> Data prepared for training.
-
-3) From prepared data:
- => Training, model hyper-parameter tuning, cross-validation, etc.
- => Models saved to storage.
-
-4) From saved models:
+
+3) From prepared data:
+ => Training, model hyper-parameter tuning, cross-validation, etc.
+ => Models saved to storage.
+
+4) From saved models:
=> Model assurance, deployment, A/B testing, etc.
=> Model deployed for online serving or offline scoring.
```
@@ -77,15 +77,15 @@ Typically data scientists responsible for item 2)-4), 1)
typically handled by a
### Pain \#1 Complex workflow/steps from raw data to model, different tools
needed by different steps, hard to make changes to workflow, and not error-proof
-It is a complex workflow from raw data to usable models, after talking to many
different data scientists, we have learned that a typical procedure to train a
new model and push to production can take months to 1-2 years.
+It is a complex workflow from raw data to usable models, after talking to many
different data scientists, we have learned that a typical procedure to train a
new model and push to production can take months to 1-2 years.
-It is also a wide skill set required by this workflow. For example, data
transformation needs tools like Spark/Hive for large scale and tools like
Pandas for a small scale. And model training needs to be switched between
XGBoost, Tensorflow, Keras, PyTorch. Building a data pipeline requires Apache
Airflow or Oozie.
+It is also a wide skill set required by this workflow. For example, data
transformation needs tools like Spark/Hive for large scale and tools like
Pandas for a small scale. And model training needs to be switched between
XGBoost, Tensorflow, Keras, PyTorch. Building a data pipeline requires Apache
Airflow or Oozie.
Yes, there are great, standardized open-source tools built for many of such
purposes. But how about changes need to be made for a particular part of the
data pipeline? How about adding a few columns to the training data for
experiments? How about training models, and push models to validation, A/B
testing before rolling to production? All these steps need jumping between
different tools, UIs, and very hard to make changes, and it is not error-proof
during these procedures.
### Pain \#2 Dependencies of underlying resource management platform
-To make jobs/services required by a machine learning platform to be able to
run, we need an underlying resource management platform. There're some choices
of resource management platform, and they have distinct advantages and
disadvantages.
+To make jobs/services required by a machine learning platform to be able to
run, we need an underlying resource management platform. There're some choices
of resource management platform, and they have distinct advantages and
disadvantages.
For example, there're many machine learning platform built on top of K8s. It
is relatively easy to get a K8s from a cloud vendor, easy to orchestrate
machine learning required services/daemons run on K8s. However, K8s doesn't
offer good support jobs like Spark/Flink/Hive. So if your company has
Spark/Flink/Hive running on YARN, there're gaps and a significant amount of
work to move required jobs from YARN to K8s. Maintaining a separate K8s cluster
is also overhead to Hadoop-based data in [...]
@@ -95,7 +95,7 @@ Similarly, if your company's data pipelines are mostly built
on top of cloud res
In addition to the above pain, we do see Data Scientists are forced to learn
underlying platform knowledge to be able to build a real-world machine learning
workflow.
-For most of the data scientists we talked with, they're experts of ML
algorithms/libraries, feature engineering, etc. They're also most familiar with
Python, R, and some of them understand Spark, Hive, etc.
+For most of the data scientists we talked with, they're experts of ML
algorithms/libraries, feature engineering, etc. They're also most familiar with
Python, R, and some of them understand Spark, Hive, etc.
If they're asked to do interactions with lower-level components like
fine-tuning a Spark job's performance; or troubleshooting job failed to launch
because of resource constraints; or write a K8s/YARN job spec and mount
volumes, set networks properly. They will scratch their heads and typically
cannot perform these operations efficiently.
@@ -115,11 +115,11 @@ An abstraction layer/framework to help the developer to
boost ML pipeline develo
### A little bit history
-Initially, Submarine is built to solve problems of running deep learning jobs
like Tensorflow/PyTorch on Apache Hadoop YARN, allows admin to monitor launched
deep learning jobs, and manage generated models.
+Initially, Submarine is built to solve problems of running deep learning jobs
like Tensorflow/PyTorch on Apache Hadoop YARN, allows admin to monitor launched
deep learning jobs, and manage generated models.
It was part of YARN initially, and code resides under
`hadoop-yarn-applications`. Later, the community decided to convert it to be a
subproject within Hadoop (Sibling project of YARN, HDFS, etc.) because we want
to support other resource management platforms like K8s. And finally, we're
reconsidering Submarine's charter, and the Hadoop community voted that it is
the time to moved Submarine to a separate Apache TLP.
-### Why Submarine?
+### Why Submarine?
`ONE PLATFORM`
@@ -145,22 +145,22 @@ A running notebook instance is called notebook session
(or session for short).
### Experiment
-Experiments of Submarine is an offline task. It could be a shell command, a
Python command, a Spark job, a SQL query, or even a workflow.
+Experiments of Submarine is an offline task. It could be a shell command, a
Python command, a Spark job, a SQL query, or even a workflow.
The primary purposes of experiments under Submarine's context is to do
training tasks, offline scoring, etc. However, experiment can be generalized to
do other tasks as well.
-Major requirement of experiment:
+Major requirement of experiment:
1) Experiments can be submitted from UI/CLI/SDK.
2) Experiments can be monitored/managed from UI/CLI/SDK.
-3) Experiments should not bind to one resource management platform (K8s/YARN).
+3) Experiments should not bind to one resource management platform (K8s).
#### Type of experiments

-There're two types of experiments:
-`Adhoc experiments`: which includes a Python/R/notebook, or even an adhoc
Tensorflow/PyTorch task, etc.
+There're two types of experiments:
+`Adhoc experiments`: which includes a Python/R/notebook, or even an adhoc
Tensorflow/PyTorch task, etc.
`Predefined experiment library`: This is specialized experiments, which
including developed libraries such as CTR, BERT, etc. Users are only required
to specify a few parameters such as input, output, hyper parameters, etc.
Instead of worrying about where's training script/dependencies located.
@@ -169,15 +169,15 @@ There're two types of experiments:
Requirements:
- Allow run adhoc scripts.
-- Allow model engineer, data scientist to run Tensorflow/Pytorch programs on
YARN/K8s/Container-cloud.
-- Allow jobs easy access data/models in HDFS/s3, etc.
+- Allow model engineer, data scientist to run Tensorflow/Pytorch programs on
K8s/Container-cloud.
+- Allow jobs easy access data/models in HDFS/s3, etc.
- Support run distributed Tensorflow/Pytorch jobs with simple configs.
- Support run user-specified Docker images.
- Support specify GPU and other resources.
#### Predefined experiment library
-Here's an example of predefined experiment library to train deepfm model:
+Here's an example of predefined experiment library to train deepfm model:
```
{
@@ -205,20 +205,20 @@ Predefined experiment libraries can be shared across
users on the same platform,
We will also model AutoML, auto hyper-parameter tuning to predefined
experiment library.
-#### Pipeline
+#### Pipeline
Pipeline is a special kind of experiment:
-- A pipeline is a DAG of experiments.
+- A pipeline is a DAG of experiments.
- Can be also treated as a special kind of experiment.
- Users can submit/terminate a pipeline.
- Pipeline can be created/submitted via UI/API.
### Environment Profiles
-Environment profiles (or environment for short) defines a set of libraries and
when Docker is being used, a Docker image in order to run an experiment or a
notebook.
+Environment profiles (or environment for short) defines a set of libraries and
when Docker is being used, a Docker image in order to run an experiment or a
notebook.
-Docker or VM image (such as AMI: Amazon Machine Images) defines the base layer
of the environment.
+Docker or VM image (such as AMI: Amazon Machine Images) defines the base layer
of the environment.
On top of that, users can define a set of libraries (such as Python/R) to
install.
@@ -228,16 +228,16 @@ Environments can be added/listed/deleted/selected through
CLI/SDK.
### Model
-#### Model management
+#### Model management
- Model artifacts are generated by experiments or notebook.
-- A model consists of artifacts from one or multiple files.
+- A model consists of artifacts from one or multiple files.
- Users can choose to save, tag, version a produced model.
- Once The Model is saved, Users can do the online model serving or offline
scoring of the model.
#### Model serving
-After model saved, users can specify a serving script, a model and create a
web service to serve the model.
+After model saved, users can specify a serving script, a model and create a
web service to serve the model.
We call the web service to "endpoint". Users can manage (add/stop) model
serving endpoints via CLI/API/UI.
@@ -247,36 +247,36 @@ Submarine-SDK provides tracking/metrics APIs, which
allows developers to add tra
### Deployment
-Submarine Services (See architecture overview below) should be deployed easily
on-prem / on-cloud. Since there're more and more public cloud offering for
compute/storage management on cloud, we need to support deploy Submarine
compute-related workloads (such as notebook session, experiments, etc.) to
cloud-managed clusters.
+Submarine Services (See architecture overview below) should be deployed easily
on-prem / on-cloud. Since there're more and more public cloud offering for
compute/storage management on cloud, we need to support deploy Submarine
compute-related workloads (such as notebook session, experiments, etc.) to
cloud-managed clusters.
This also include Submarine may need to take input parameters from customers
and create/manage clusters if needed. It is also a common requirement to use
hybrid of on-prem/on-cloud clusters.
### Security / Access Control / User Management / Quota Management
-There're 4 kinds of objects need access-control:
+There're 4 kinds of objects need access-control:
- Assets belong to Submarine system, which includes notebook, experiments and
results, models, predefined experiment libraries, environment profiles.
-- Data security. (Who owns what data, and what data can be accessed by each
users).
+- Data security. (Who owns what data, and what data can be accessed by each
users).
- User credentials. (Such as LDAP).
- Other security, such as Git repo access, etc.
-For the data security / user credentials / other security, it will be
delegated to 3rd libraries such as Apache Ranger, IAM roles, etc.
+For the data security / user credentials / other security, it will be
delegated to 3rd libraries such as Apache Ranger, IAM roles, etc.
Assets belong to Submarine system will be handled by Submarine itself.
-Here're operations which Submarine admin can do for users / teams which can be
used to access Submarine's assets.
+Here're operations which Submarine admin can do for users / teams which can be
used to access Submarine's assets.
-**Operations for admins**
+**Operations for admins**
-- Admin uses "User Management System" to onboard new users, upload user
credentials, assign resource quotas, etc.
-- Admins can create new users, new teams, update user/team mappings. Or remove
users/teams.
+- Admin uses "User Management System" to onboard new users, upload user
credentials, assign resource quotas, etc.
+- Admins can create new users, new teams, update user/team mappings. Or remove
users/teams.
- Admin can set resource quotas (if different from system default),
permissions, upload/update necessary credentials (like Kerberos keytab) of a
user.
- A DE/DS can also be an admin if the DE/DS has admin access. (Like a
privileged user). This will be useful when a cluster is exclusively shared by a
user or only shared by a small team.
- `Resource Quota Management System` helps admin to manage resources quotas of
teams, organizations. Resources can be machine resources like CPU/Memory/Disk,
etc. It can also include non-machine resources like $$-based budgets.
-### Dataset
+### Dataset
-There's also need to tag dataset which will be used for training and shared
across the platform by different users.
+There's also need to tag dataset which will be used for training and shared
across the platform by different users.
Like mentioned above, access to the actual data will be handled by 3rd party
system like Apache Ranger / Hive Metastore which is out of the Submarine's
scope.
@@ -300,7 +300,7 @@ Like mentioned above, access to the actual data will be
handled by 3rd party sys
| |Experiment | |Compute Resource | |Other Management | |
| |Manager | | Manager | |Services | |
| +-----------------+ +-----------------+ +---------------------+ |
- | Spark, template YARN/K8s/Docker |
+ | Spark, template K8s/Docker |
| TF, PyTorch, pipeline |
| |
+ +-----------------+ +
diff --git a/website/docs/designDocs/experiment-implementation.md
b/website/docs/designDocs/experiment-implementation.md
index a87bb89..ea110da 100644
--- a/website/docs/designDocs/experiment-implementation.md
+++ b/website/docs/designDocs/experiment-implementation.md
@@ -20,7 +20,7 @@ title: Experiment Implementation
This document talks about implementation of experiment, flows and design
considerations.
-Experiment consists of following components, also interact with other
Submarine or 3rd-party components, showing below:
+Experiment consists of following components, also interact with other
Submarine or 3rd-party components, showing below:
```
@@ -44,18 +44,18 @@ Experiment consists of following components, also interact
with other Submarine
| (Launch Task with resources)
+
+---------------------------------+
- |Resource Manager (K8s/YARN/Cloud)|
+ |Resource Manager (K8s/Cloud)|
+---------------------------------+
```
-As showing in the above diagram, Submarine experiment consists of the
following items:
+As showing in the above diagram, Submarine experiment consists of the
following items:
-- On the left side, there're input data and run configs.
-- In the middle box, they're experiment tasks, it could be multiple tasks when
we run distributed training, pipeline, etc.
- - There're main runnable code, such as `train.py` for the training main
entry point.
- - The two boxes below: experiment dependencies and OS/Base libraries we
called `Submarine Environment Profile` or `Environment` for short. Which
defined what is the basic libraries to run the main experiment code.
- - Experiment tasks are launched by Resource Manager, such as K8s/YARN/Cloud
or just launched locally. There're resources constraints for each experiment
tasks. (e.g. how much memory, cores, GPU, disk etc. can be used by tasks).
-- On the right side, they're artifacts generated by experiments:
+- On the left side, there're input data and run configs.
+- In the middle box, they're experiment tasks, it could be multiple tasks when
we run distributed training, pipeline, etc.
+ - There're main runnable code, such as `train.py` for the training main
entry point.
+ - The two boxes below: experiment dependencies and OS/Base libraries we
called `Submarine Environment Profile` or `Environment` for short. Which
defined what is the basic libraries to run the main experiment code.
+ - Experiment tasks are launched by Resource Manager, such as K8s/Cloud or
just launched locally. There're resources constraints for each experiment
tasks. (e.g. how much memory, cores, GPU, disk etc. can be used by tasks).
+- On the right side, they're artifacts generated by experiments:
- Output artifacts: Which are main output of the experiment, it could be
model(s), or output data when we do batch prediction.
- Logs/Metrics for further troubleshooting or understanding of experiment's
quality.
@@ -63,7 +63,7 @@ For the rest of the design doc, we will talk about how we
handle environment, co
## API of Experiment
-This is not a full definition of experiment, for more details, please
reference to experiment API.
+This is not a full definition of experiment, for more details, please
reference to experiment API.
Here's just an example of experiment object which help developer to understand
what included in an experiment.
@@ -74,17 +74,17 @@ experiment:
environment: "team-default-ml-env"
code:
sync_mode: s3
- url: "s3://bucket/training-job.tar.gz"
- parameter: > python training.py --iteration 10
+ url: "s3://bucket/training-job.tar.gz"
+ parameter: > python training.py --iteration 10
--input=s3://bucket/input output=s3://bucket/output
- resource_constraint:
+ resource_constraint:
res="mem=20gb, vcore=3, gpu=2"
timeout: "30 mins"
```
-This defined a "script" experiment, which has a name "abc", the name can be
used to track the experiment. There's environment "team-default-ml-env" defined
to make sure dependencies of the job can be downloaded properly before
executing the job.
+This defined a "script" experiment, which has a name "abc", the name can be
used to track the experiment. There's environment "team-default-ml-env" defined
to make sure dependencies of the job can be downloaded properly before
executing the job.
-`code` defined where the experiment code will be downloaded, we will support a
couple of sync_mode like s3 (or abfs/hdfs), git, etc.
+`code` defined where the experiment code will be downloaded, we will support a
couple of sync_mode like s3 (or abfs/hdfs), git, etc.
Different types of experiments will have different specs, for example
distributed Tensorflow spec may look like:
@@ -92,18 +92,18 @@ Different types of experiments will have different specs,
for example distribute
experiment:
name: "abc-distributed-tf",
type: "distributed-tf",
- ps:
+ ps:
environment: "team-default-ml-cpu"
- resource_constraint:
+ resource_constraint:
res="mem=20gb, vcore=3, gpu=0"
- worker:
+ worker:
environment: "team-default-ml-gpu"
- resource_constraint:
+ resource_constraint:
res="mem=20gb, vcore=3, gpu=2"
code:
sync_mode: git
- url: "https://foo.com/training-job.git"
- parameter: > python /code/training-job/training.py --iteration 10
+ url: "https://foo.com/training-job.git"
+ parameter: > python /code/training-job/training.py --iteration 10
--input=s3://bucket/input output=s3://bucket/output
tensorboard: enabled
timeout: "30 mins"
@@ -134,7 +134,7 @@ To better understand experiment implementation, It will be
good to understand wh
Before submit the environment, you have to choose what environment to choose.
Environment defines dependencies, etc. of an experiment or a notebook. might
looks like below:
```
-conda_environment =
+conda_environment =
"""
name: conda-env
channels:
@@ -156,7 +156,7 @@ environment = create_environment {
}
```
-To better understand how environment works, please refer to
[environment-implementation](./environments-implementation.md).
+To better understand how environment works, please refer to
[environment-implementation](./environments-implementation.md).
### Create experiment, specify where's training code located, and parameters.
@@ -164,7 +164,7 @@ For ad-hoc experiment (code located at S3), assume
training code is part of the
```
experiment = create_experiment {
- Environment = environment,
+ Environment = environment,
ExperimentConfig = {
type = "adhoc",
localize_artifacts = [
@@ -184,7 +184,7 @@ It is possible we want to run a notebook file in offline
mode, to do that, here'
```
experiment = create_experiment {
- Environment = environment,
+ Environment = environment,
ExperimentConfig = {
type = "adhoc",
localize_artifacts = [
@@ -203,12 +203,12 @@ experiment.wait_for_finish(print_output=True)
```
experiment = create_experiment {
# Here you can use default environment of library
- Environment = environment,
+ Environment = environment,
ExperimentConfig = {
type = "template",
name = "abc",
- # A unique name of template
- template = "deepfm_ctr",
+ # A unique name of template
+ template = "deepfm_ctr",
# yaml file defined what is the parameters need to be specified.
parameter = {
Input: "S3://.../input",
@@ -238,7 +238,7 @@ There's a common misunderstanding about what is the
differences between running
| Run history (meta, logs, metrics) | Meta/logs/metrics can be traced from
experiment UI (or corresponding API) | No run history can be traced from
Submarine UI/API. Can view the current running paragraph's log/metrics, etc. |
| What to run? | Code from Docker image or shared storage
(like Tarball on S3, Github, etc.) | Local in the notebook's paragraph
|
-**Commonalities**
+**Commonalities**
| | Experiment & Notebook Session |
| ----------- | ------------------------------------------------- |
@@ -254,21 +254,21 @@ The experiment manager receives the experiment requests,
persisting the experime
### Compute Cluster Manager
-After experiment accepted by experiment manager, based on which cluster the
experiment intended to run (like mentioned in the previous sections, Submarine
supports to manage multiple compute clusters), compute cluster manager will
returns credentials to access the compute cluster. It will also be responsible
to create a new compute cluster if needed.
+After experiment accepted by experiment manager, based on which cluster the
experiment intended to run (like mentioned in the previous sections, Submarine
supports to manage multiple compute clusters), compute cluster manager will
returns credentials to access the compute cluster. It will also be responsible
to create a new compute cluster if needed.
-For most of the on-prem use cases, there's only one cluster involved, for such
cases, ComputeClusterManager returns credentials to access local cluster if
needed.
+For most of the on-prem use cases, there's only one cluster involved, for such
cases, ComputeClusterManager returns credentials to access local cluster if
needed.
### Experiment Submitter
-Experiment Submitter handles different kinds of experiments to run (e.g.
ad-hoc script, distributed TF, MPI, pre-defined templates, Pipeline, AutoML,
etc.). And such experiments can be managed by different resource management
systems (e.g. K8s, YARN, container cloud, etc.)
+Experiment Submitter handles different kinds of experiments to run (e.g.
ad-hoc script, distributed TF, MPI, pre-defined templates, Pipeline, AutoML,
etc.). And such experiments can be managed by different resource management
systems (e.g. K8s, container cloud, etc.)
-To meet the requirements to support variant kinds of experiments and resource
managers, we choose to use plug-in modules to support different submitters
(which requires jars to submarine-server’s classpath).
+To meet the requirements to support variant kinds of experiments and resource
managers, we choose to use plug-in modules to support different submitters
(which requires jars to submarine-server’s classpath).
To avoid jars and dependencies of plugins break the submarine-server, the
plug-ins manager, or both. To solve this issue, we can instantiate submitter
plug-ins using a classloader that is different from the system classloader.
#### Submitter Plug-ins
-Each plug-in uses a separate module under the server-submitter module. As the
default implements, we provide for YARN and K8s. For YARN cluster, we provide
the submitter-yarn and submitter-yarnservice plug-ins. The submitter-yarn
plug-in used the [TonY](https://github.com/linkedin/TonY) as the runtime to run
the training job, and the submitter-yarnservice plug-in direct use the [YARN
Service](https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/yarn-service/Overview.html)
w [...]
+Each plug-in uses a separate module under the server-submitter module. As the
default implements, we provide for K8s.
The submitter-k8s plug-in is used to submit the job to Kubernetes cluster and
use the
[operator](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/) as
the runtime. The submitter-k8s plug-in implements the operation of CRD object
and provides the java interface. In the beginning, we use the
[tf-operator](https://github.com/kubeflow/tf-operator) for the TensorFlow.
@@ -305,7 +305,7 @@ The monitor tracks the experiment life cycle and records
the main events and key
| create a new one.| to submit |+---------------> |
| | Different kinds | Once job is |
| | of experiments | submitted, use |+----+
- | | to k8s/yarn, etc| monitor to get | |
+ | | to k8s, etc| monitor to get | |
| | | status updates | |
| | | | |
Monitor
| | | | |
Xperiment
@@ -325,11 +325,11 @@ TODO: add more details about template, environment, etc.
## Common modules of experiment/notebook-session/model-serving
-Experiment/notebook-session/model-serving share a lot of commonalities, all of
them are:
+Experiment/notebook-session/model-serving share a lot of commonalities, all of
them are:
-- Some workloads running on YARN/K8s.
-- Need persist meta data to DB.
-- Need monitor task/service running status from resource management system.
+- Some workloads running on K8s.
+- Need persist meta data to DB.
+- Need monitor task/service running status from resource management system.
We need to make their implementation are loose-coupled, but at the same time,
share some building blocks as much as possible (e.g. submit PodSpecs to K8s,
monitor status, get logs, etc.) to reduce duplications.
@@ -374,29 +374,29 @@ The template will be (in yaml format):
```yaml
# deepfm.ctr template
name: deepfm.ctr
-author:
+author:
description: >
This is a template to run CTR training using deepfm algorithm, by default it
runs
single node TF job, you can also overwrite training parameters to use
distributed
- training.
-
-parameters:
+ training.
+
+parameters:
- name: input.train_data
- required: true
+ required: true
description: >
- train data is expected in SVM format, and can be stored in HDFS/S3
+ train data is expected in SVM format, and can be stored in HDFS/S3
...
- name: training.batch_size
required: false
- default: 32
+ default: 32
description: This is batch size of training
```
-The batch format can be used in UI/API.
+The batch format can be used in UI/API.
### Handle Predefined-experiment-template from server side
-Please note that, the conversion of predefined-experiment-template will be
always handled by server. The invoke flow looks like:
+Please note that, the conversion of predefined-experiment-template will be
always handled by server. The invoke flow looks like:
```
@@ -431,9 +431,9 @@ Please note that, the conversion of
predefined-experiment-template will be alway
+----------------------------------------------------+
```
-Basically, from Client, it submitted template parameters to Submarine Server,
inside submarine server, it finds the corresponding template handler based on
the name. And the template handler converts input parameters to an actual
experiment, such as a distributed TF experiment. After that, it goes the
similar route to validate experiment spec, compute cluster manager, etc. to get
the experiment submitted and monitored.
+Basically, from Client, it submitted template parameters to Submarine Server,
inside submarine server, it finds the corresponding template handler based on
the name. And the template handler converts input parameters to an actual
experiment, such as a distributed TF experiment. After that, it goes the
similar route to validate experiment spec, compute cluster manager, etc. to get
the experiment submitted and monitored.
-Predefined-experiment-template is able to create any kind of experiment, it
could be a pipeline:
+Predefined-experiment-template is able to create any kind of experiment, it
could be a pipeline:
```
diff --git a/website/docs/designDocs/implementation-notes.md
b/website/docs/designDocs/implementation-notes.md
index 7ebb996..ca226c4 100644
--- a/website/docs/designDocs/implementation-notes.md
+++ b/website/docs/designDocs/implementation-notes.md
@@ -22,12 +22,12 @@ Before digging into details of implementations, you should
read [architecture-an
Here're sub topics of Submarine implementations:
- [Submarine Storage](./storage-implementation.md): How to store metadata,
logs, metrics, etc. of Submarine.
-- [Submarine Environment](./environments-implementation.md): How environments
created, managed, stored in Submarine.
+- [Submarine Environment](./environments-implementation.md): How environments
created, managed, stored in Submarine.
- [Submarine Experiment](./experiment-implementation.md): How experiments
managed, stored, and how the predefined experiment template works.
- [Submarine Notebook](./notebook-implementation.md): How experiments managed,
stored, and how the predefined experiment template works.
- [Submarine Server](./submarine-server/architecture.md): How Submarine server
is designed, architecture, implementation notes, etc.
-Working-in-progress designs, Below are designs which are working-in-progress,
we will move them to the upper section once design & review is finished:
+Working-in-progress designs, Below are designs which are working-in-progress,
we will move them to the upper section once design & review is finished:
- [Submarine HA Design](./wip-designs/submarine-clusterServer.md): How
Submarine HA can be achieved, using RAFT, etc.
-- [Submarine services deployment module:](./wip-designs/submarine-launcher.md)
How to deploy submarine services to k8s, YARN or cloud.
+- [Submarine services deployment module:](./wip-designs/submarine-launcher.md)
How to deploy submarine services to k8s or cloud.
diff --git a/website/docs/designDocs/submarine-server/architecture.md
b/website/docs/designDocs/submarine-server/architecture.md
index 5b572c1..4f1750f 100644
--- a/website/docs/designDocs/submarine-server/architecture.md
+++ b/website/docs/designDocs/submarine-server/architecture.md
@@ -47,10 +47,10 @@ title: Submarine Server Implementation
Here's a diagram to illustrate the Submarine's deployment.
- Submarine Server consists of web service/proxy, and backend services.
They're like "control planes" of Submarine, and users will interact with these
services.
-- Submarine server could be a microservice architecture and can be deployed to
one of the compute clusters. (see below, this will be useful when we only have
one cluster).
+- Submarine server could be a microservice architecture and can be deployed to
one of the compute clusters. (see below, this will be useful when we only have
one cluster).
- There're multiple compute clusters that could be used by Submarine service.
For user's running notebook instance, jobs, etc. they will be placed to one of
the compute clusters by user's preference or defined policies.
- Submarine's asset includes
project/notebook(content)/models/metrics/dataset-meta, etc. can be stored
inside Submarine's own database.
-- Datasets can be stored in various locations such as S3/HDFS.
+- Datasets can be stored in various locations such as S3/HDFS.
- Users can push container (such as Docker) images to a preconfigured registry
in Submarine, so Submarine service can know how to pull required container
images.
- Image Registry/Data-Storage, etc. are outside of Submarine server's scope
and should be managed by 3rd party applications.
@@ -74,7 +74,7 @@ Submarine Server exposed UI and REST API. Users can also use
CLI / SDK to manage
+----------+
```
-REST API will be used by the other 3 approaches. (CLI/SDK/UI)
+REST API will be used by the other 3 approaches. (CLI/SDK/UI)
The REST API Service handles HTTP requests and is responsible for
authentication. It acts as the caller for the JobManager component.
@@ -82,25 +82,25 @@ The REST component defines the generic job spec which
describes the detailed inf
## Proposal
```
-
+---------------------+
- +-----------+ | +--------+
+----+ |
- | | |
|runtime1+-->+job1| |
- | workbench +---+ +----------------------------------+ | +--------+
+----+ |
- | | | | +------+ +---------------------+ | +-->+ +--------+
+----+ |
- +-----------+ | | | | | +------+ +-------+ | | | |
|runtime2+-->+job2| |
- | | | | | | YARN | | K8s | | | | | +--------+
+----+ |
- +-----------+ | | | | | +------+ +-------+ | | | | YARN
Cluster |
- | | | | | | | submitter | | |
+---------------------+
- | CLI +------>+ | REST | +---------------------+ +---+
- | | | | | | +---------------------+ | |
+---------------------+
- +-----------+ | | | | | +-------+ +-------+ | | | | +--------+
+----+ |
- | | | | | |PlugMgr| |monitor| | | | | |
+-->+job1| |
- +-----------+ | | | | | +-------+ +-------+ | | | | | |
+----+ |
- | | | | | | | JobManager | | +-->+ |operator|
+----+ |
- | SDK +---+ | +------+ +---------------------+ | | |
+-->+job2| |
- | | +----------------------------------+ | +--------+
+----+ |
- +-----------+ | K8s
Cluster |
- client server
+---------------------+
+
+ +-----------+
+ | |
+ | workbench +---+ +----------------------------------+
+ | | | | +------+ +---------------------+ |
+ +-----------+ | | | | | +-------+ | |
+---------------------+
+ | | | | | | K8s | | | | +--------+
+----+ |
+ +-----------+ | | | | | +-------+ | | | |
+-->+job1| |
+ | | | | | | | submitter | | | | |
+----+ |
+ | CLI +------>+ | REST | +---------------------+ +---->+ |operator|
+----+ |
+ | | | | | | +---------------------+ | | |
+-->+job2| |
+ +-----------+ | | | | | +-------+ +-------+ | | | +--------+
+----+ |
+ | | | | | |PlugMgr| |monitor| | | | K8s
Cluster |
+ +-----------+ | | | | | +-------+ +-------+ | |
+---------------------+
+ | | | | | | | JobManager | |
+ | SDK +---+ | +------+ +---------------------+ |
+ | | +----------------------------------+
+ +-----------+
+ client server
```
We propose to split the original core module in the old layout into two
modules, CLI and server as shown in FIG. The submarine-client calls the REST
APIs to submit and retrieve the job info. The submarine-server provides the
REST service, job management, submitting the job to cluster, and running job in
different clusters through the corresponding runtime.
@@ -126,11 +126,11 @@ We propose to split the original core module in the old
layout into two modules,
+----------------------------------------------------------------------+
```
-### Experiment Manager
+### Experiment Manager
TODO
-### Notebook Sessions Manager
+### Notebook Sessions Manager
TODO
@@ -142,7 +142,7 @@ TODO
TODO
-### Model Serving Manager
+### Model Serving Manager
TODO
@@ -150,11 +150,11 @@ TODO
TODO
-### Dataset Manager
+### Dataset Manager
TODO
-### User/team permissions manager
+### User/team permissions manager
TODO
@@ -164,4 +164,4 @@ TODO
## Components/services outside of Submarine Server's scope
-TODO: Describe what are the out-of-scope components, which should be handled
and managed outside of Submarine server. Candidates are: Identity management,
data storage, metastore storage, etc.
\ No newline at end of file
+TODO: Describe what are the out-of-scope components, which should be handled
and managed outside of Submarine server. Candidates are: Identity management,
data storage, metastore storage, etc.
diff --git a/website/docs/designDocs/submarine-server/experimentSpec.md
b/website/docs/designDocs/submarine-server/experimentSpec.md
index fc2abfb..f705a87 100644
--- a/website/docs/designDocs/submarine-server/experimentSpec.md
+++ b/website/docs/designDocs/submarine-server/experimentSpec.md
@@ -1,5 +1,5 @@
---
-title: Generic Expeiment Spec
+title: Generic Experiment Spec
---
<!--
@@ -61,13 +61,13 @@ The library spec describes the info about machine learning
framework. All the fi
| envVars | key/value | YES | The public env vars for the task if not
specified. |
### Submitter Spec
-It describes the info of submitter which the user specified, such as yarn,
yarnservice or k8s. All the fields as below:
+It describes the info of submitter which the user specified, such as k8s. All
the fields as below:
| field | type | optional | description |
|---|---|---|---|
| type | string | NO | The submitter type, supports `k8s` now |
| configPath | string | YES | The config path of the specified resource
manager. You can set it in submarine-site.xml if run submarine-server locally |
-| namespace | string | NO | It's known as queue in Apache Hadoop YARN and
namespace in Kubernetes. |
+| namespace | string | NO | It's known as namespace in Kubernetes. |
| kind | string | YES | It's used for k8s submitter, supports TFJob and
PyTorchJob |
| apiVersion | string | YES | It should pair with the kind, such as the
TFJob's api version is `kubeflow.org/v1` |
diff --git a/website/docs/designDocs/wip-designs/submarine-launcher.md
b/website/docs/designDocs/wip-designs/submarine-launcher.md
index 2cc0ee9..6f05d33 100644
--- a/website/docs/designDocs/wip-designs/submarine-launcher.md
+++ b/website/docs/designDocs/wip-designs/submarine-launcher.md
@@ -17,45 +17,45 @@ title: Submarine Launcher
-->
:::warning
-Please note that this design doc is working-in-progress and need more works to
complete.
+Please note that this design doc is working-in-progress and need more works to
complete.
:::
## Introduction
Submarine is built and run in Cloud Native, taking advantage of the cloud
computing model.
-To give full play to the advantages of cloud computing.
-These applications are characterized by rapid and frequent build, release, and
deployment.
-Combined with the features of cloud computing, they are decoupled from the
underlying hardware and operating system,
+To give full play to the advantages of cloud computing.
+These applications are characterized by rapid and frequent build, release, and
deployment.
+Combined with the features of cloud computing, they are decoupled from the
underlying hardware and operating system,
and can easily meet the requirements of scalability, availability, and
portability. And provide better economy.
-In the enterprise data center, submarine can support k8s/yarn/docker three
resource scheduling systems;
+In the enterprise data center, submarine can support k8s/docker three resource
scheduling systems;
in the public cloud environment, submarine can support these cloud services in
GCE/AWS/Azure;
## Requirement
### Cloud-Native Service
-The submarine server is a long-running services in the daemon mode.
-The submarine server is mainly used by algorithm engineers to provide online
front-end functions such as algorithm development,
-algorithm debugging, data processing, and workflow scheduling.
+The submarine server is a long-running services in the daemon mode.
+The submarine server is mainly used by algorithm engineers to provide online
front-end functions such as algorithm development,
+algorithm debugging, data processing, and workflow scheduling.
And submarine server also mainly used for back-end functions such as
scheduling and execution of jobs, tracking of job status, and so on.
-Through the ability of rolling upgrades, we can better provide system
stability.
+Through the ability of rolling upgrades, we can better provide system
stability.
For example, we can upgrade or restart the workbench server without affecting
the normal operation of submitted jobs.
You can also make full use of system resources.
For example, when the number of current developers or job tasks increases,
The number of submarine server instances can be adjusted dynamically.
-In addition, submarine will provide each user with a completely independent
workspace container.
-This workspace container has already deployed the development tools and
library files commonly used by algorithm engineers including their operating
environment.
+In addition, submarine will provide each user with a completely independent
workspace container.
+This workspace container has already deployed the development tools and
library files commonly used by algorithm engineers including their operating
environment.
Algorithm engineers can work in our prepared workspaces without any extra work.
Each user's workspace can also be run through a cloud service.
### Service discovery
-With the cluster function of submarine, each service only needs to run in the
container,
-and it will automatically register the service in the submarine cluster
center.
+With the cluster function of submarine, each service only needs to run in the
container,
+and it will automatically register the service in the submarine cluster center.
Submarine cluster management will automatically maintain the relationship
between service and service, service and user.
## Design
@@ -65,16 +65,16 @@ Submarine cluster management will automatically maintain
the relationship betwee
### Launcher
-The submarine launcher module defines the complete interface.
-By using this interface, you can run the submarine server, and workspace in
k8s / yarn / docker / AWS / GCE / Azure.
+The submarine launcher module defines the complete interface.
+By using this interface, you can run the submarine server, and workspace in
k8s / docker / AWS / GCE / Azure.
### Launcher On Docker
-In order to allow some small and medium-sized users without k8s/yarn to use
submarine,
+In order to allow some small and medium-sized users without k8s to use
submarine,
we support running the submarine system in docker mode.
-Users only need to provide several servers with docker runtime environment.
-The submarine system can automatically cluster these servers into clusters,
manage all the hardware resources of the cluster,
+Users only need to provide several servers with docker runtime environment.
+The submarine system can automatically cluster these servers into clusters,
manage all the hardware resources of the cluster,
and run the service or workspace container in this cluster through scheduling
algorithms.
@@ -82,9 +82,6 @@ and run the service or workspace container in this cluster
through scheduling al
submarine operator
-### Launcher On Yarn
-[TODO]
-
### Launcher On AWS
[TODO]
diff --git a/website/docs/devDocs/README.md b/website/docs/devDocs/README.md
index d25ab35..407fd8e 100644
--- a/website/docs/devDocs/README.md
+++ b/website/docs/devDocs/README.md
@@ -25,7 +25,7 @@ This document mainly describes the structure of each module
of the Submarine pro
### 2.1. submarine-client
-Provide the CLI interface for submarine user. (Currently only support YARN
service)
+Provide the CLI interface for submarine user. (Currently only support YARN
service (deprecated))
### 2.2. submarine-cloud-v2
@@ -45,7 +45,7 @@ Provide Python SDK for submarine user.
### 2.6. submarine-server
-Include core server, restful api, and k8s/yarn submitter.
+Include core server, restful api, and k8s submitter.
### 2.7. submarine-test
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]