This is an automated email from the ASF dual-hosted git repository.
wanggenhua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/dolphinscheduler-website.git
The following commit(s) were added to refs/heads/master by this push:
new b021424931 ADD Blog (#871)
b021424931 is described below
commit b0214249314e125cb2a1ccca934403def149fe6c
Author: lifeng <[email protected]>
AuthorDate: Mon Dec 26 17:22:45 2022 +0800
ADD Blog (#871)
* ADD Blog
* updata
* updata
---
blog/en-us/Apache_dolphinScheduler_3.1.2.md | 68 ++++++
...inTech_data_center_based_on_DolphinScheduler.md | 176 +++++++++++++++
...e_DolphinScheduler_Machine_Learning_Workflow.md | 236 +++++++++++++++++++++
blog/img/media/16720397220045/16720397367629.jpg | Bin 0 -> 59519 bytes
blog/img/media/16720400637574/16720400704016.jpg | Bin 0 -> 83706 bytes
blog/img/media/16720400637574/16720400759248.jpg | Bin 0 -> 58667 bytes
blog/img/media/16720400637574/16720401185208.jpg | Bin 0 -> 46943 bytes
blog/img/media/16720400637574/16720401253440.jpg | Bin 0 -> 126403 bytes
blog/img/media/16720400637574/16720401472681.jpg | Bin 0 -> 67800 bytes
blog/img/media/16720400637574/16720402083983.jpg | Bin 0 -> 33154 bytes
blog/img/media/16720400637574/16720402291980.jpg | Bin 0 -> 63322 bytes
blog/img/media/16720400637574/16720402508893.jpg | Bin 0 -> 65984 bytes
blog/img/media/16720400637574/16720402711565.jpg | Bin 0 -> 27768 bytes
blog/img/media/16720400637574/16720402758234.jpg | Bin 0 -> 50365 bytes
blog/img/media/16720400637574/16720403297820.jpg | Bin 0 -> 61933 bytes
blog/img/media/16720400637574/16720403572773.jpg | Bin 0 -> 25883 bytes
blog/img/media/16720400637574/16720403977529.jpg | Bin 0 -> 47344 bytes
blog/img/media/16720400637574/16720404142720.jpg | Bin 0 -> 34341 bytes
blog/img/media/16720400637574/16720404412259.jpg | Bin 0 -> 33141 bytes
blog/img/media/16720405454837/16720405586499.jpg | Bin 0 -> 155526 bytes
blog/img/media/16720405454837/16720407528096.jpg | Bin 0 -> 53369 bytes
blog/img/media/16720405454837/16720407653742.jpg | Bin 0 -> 33004 bytes
blog/img/media/16720405454837/16720408372893.jpg | Bin 0 -> 32516 bytes
blog/img/media/16720405454837/16720408471707.jpg | Bin 0 -> 40498 bytes
blog/img/media/16720405454837/16720408537181.jpg | Bin 0 -> 25097 bytes
blog/img/media/16720405454837/16720408664980.jpg | Bin 0 -> 31613 bytes
blog/img/media/16720405454837/16720408742949.jpg | Bin 0 -> 75740 bytes
blog/img/media/16720405454837/16720408868765.jpg | Bin 0 -> 36381 bytes
blog/img/media/16720405454837/16720408963992.jpg | Bin 0 -> 60099 bytes
blog/img/media/16720405454837/16720409057879.jpg | Bin 0 -> 42155 bytes
blog/img/media/16720405454837/16720409115839.jpg | Bin 0 -> 30300 bytes
blog/img/media/16720405454837/16720409204499.jpg | Bin 0 -> 14531 bytes
blog/img/media/16720405454837/16720409274430.jpg | Bin 0 -> 25945 bytes
config/blog/en-us/release.json | 6 +
config/blog/en-us/tech.json | 7 +
config/blog/en-us/user.json | 9 +-
36 files changed, 501 insertions(+), 1 deletion(-)
diff --git a/blog/en-us/Apache_dolphinScheduler_3.1.2.md
b/blog/en-us/Apache_dolphinScheduler_3.1.2.md
new file mode 100644
index 0000000000..19a69f25ea
--- /dev/null
+++ b/blog/en-us/Apache_dolphinScheduler_3.1.2.md
@@ -0,0 +1,68 @@
+---
+title: Apache DolphinScheduler releases version 3.1.2 with Python API
optimizations
+keywords: Apache,DolphinScheduler,scheduler,big
data,ETL,airflow,hadoop,orchestration,dataops,Kubernetes
+description: Recently, Apache DolphinScheduler released version 3.1.2.
+---
+# Apache DolphinScheduler releases version 3.1.2 with Python API optimizations
+
+Recently, Apache DolphinScheduler released version 3.1.2. This version is
mainly based on version 3.1.2, with 6 Python API optimizations, 19 bug fixes,
and 4 document updates.
+
+## Important bug fixes:
+
+* Worker kill process does not take effect #12995
+* Complement dependency mode generates wrong workflow instance (#13009)
+* Python task parameter passing error (#12961)
+* Fix dependency task null pointer (#12965)
+* Task retry error (#12903)
+* Shell task calls dolphinscheduler_env.sh configuration file exception
(#12909)
+* Corrected documentation for multiple Hive SQL runs (#12765)
+* Added token authentication for Python API #12893
+
+## Change Log
+
+### Bug fix
+* [Improvement] change alert start.sh (#13100)
+* [Fix] Add token as authentication for python gateway (#12893)
+* [Fix-13010] [Task] The Flink SQL task page selects the pre-job deployment
mode, but the task executed by the worker is the Flink local mode
+* [Fix-12997][API] Fix that the end time is not reset when the workflow
instance reruns. (#12998)
+* [Fix-12994] [Worker] Fix kill process does not take effect (#12995)
+* Fix sql task will send alert if we don’t choose the send email #12984
+* [Fix-13008] [UI] When using the complement function, turn on the dependent
mode to generate multiple unrelated workflow instances (#13009)
+* [Fix][doc] python api release link
+* [Fix] Python task can not pass the parameters to downstream task. (#12961)
+* [Fix] Fix Java path in Kubernetes Helm Chart (#12987)
+* [Fix-12963] [Master] Fix dependent task node null pointer exception (#12965)
+* [Fix-12954] [Schedule] Fix that workflow-level configuration information
does not take effect when timing triggers execution
+* Fix execute shell task exception no dolphinscheduler_env.sh file execute
permission (#12909)
+* Upgrade clickhouse jdbc driver #12639
+* add spring-context to alert api (#12892)
+* [Upgrade][SQL]Modify the table t_ds_worker_group to add a description field
in the postgresql upgrade script #12883
+* Fix NPE while retry task (#12903)
+* Fix-12832][API] Fix update worker group exception group name already exists.
#12874
+* Fix and enhance helm db config (#12707)
+
+### Document
+* [Fix][Doc] Fix sql-hive and hive-cli doc (#12765)
+* [Fix][Alert] Ignore alert not write info to db (#12867)
+* [Doc] Add skip spotless check during ASF release #12835
+* [Doc][Bug] Fix dead link caused by markdown cross-files anchor #12357
(#12877)
+
+### Python API
+* [Fix] python API upload resource center failed
+* [Feature] Add CURD to the project/tenant/user section of the python-DS
(#11162)
+* [Chore][Python] Change name from process definition to workflow (#12918)
+* [Feature] Support set execute type to pydolphinscheduler (#12871)
+* [Hotfix] Correct python doc link
+* [Improvement][Python] Validate version of Python API at launch (#11626)
+
+## Acknowledgment
+
+Thanks to all community contributors who participated in the release of Apache
DolphinScheduler 3.1.2. Below is the list of the contributors by GitHub ID, in
no particular order.
+
+
+
+| liqingwang | liqingwang | hezean |
+|--------------|-------------|--------------|
+| ruanwenjun | simsicon | jieguangzhou |
+| Tianqi-Dotes | zhuangchong | zhongjiajie |
+
diff --git
a/blog/en-us/Application_transformation_of_the_FinTech_data_center_based_on_DolphinScheduler.md
b/blog/en-us/Application_transformation_of_the_FinTech_data_center_based_on_DolphinScheduler.md
new file mode 100644
index 0000000000..0d6e2b5703
--- /dev/null
+++
b/blog/en-us/Application_transformation_of_the_FinTech_data_center_based_on_DolphinScheduler.md
@@ -0,0 +1,176 @@
+---
+title:Application transformation of the FinTech data center based on
DolphinScheduler
+keywords: Apache,DolphinScheduler,scheduler,big
data,ETL,airflow,hadoop,orchestration,dataops,Kubernetes
+description: On Apache DolphinScheduler Meetup last week, Feng Mingxia,
+---
+# Application transformation of the FinTech data center based on
DolphinScheduler
+
+On Apache DolphinScheduler Meetup last week, Feng Mingxia, a big data engineer
from Chengfang FinTech, brought us the application practice of DolphinScheduler
in the field of FinTech. The following is the presentation.
+
+
+Feng Mingxia, Chengfang Financial Technology Big Data Engineer
+
+Focusing on real-time and offline data processing and analysis in the field of
big data, at present, he is mainly responsible for the research and development
of data middle platforms.
+
+Speech summary:
+
+· Use background
+
+· Secondary transformation based on DolphinScheduler
+
+· DolphinScheduler plug-in expansion
+
+· Future and outlook
+
+## Use Background
+
+### Data Center Construction
+
+At present, big data technology is widely used in the financial field, and the
big data platform has become a financial infrastructure. In the construction of
a big data platform, the data center is the brightest star, which is the
entrance and interface for business systems to use big data, when various
business systems are connected to the data center, the data middle office needs
to provide unified management and unified access to ensure the security,
reliability, efficiency, and reli [...]
+
+As shown in the figure below, the data middle office is in the middle link
between the business systems and the big data platform, each business system
accesses the big data platform through the services provided by the data center.
+
+
+The core concept of the data middle office is to realize four modernizations,
namely, business data, data asset, asset service, and service business. From
business to data, and back to the complete closed loop formed by business,
support the digital transformation of enterprises.
+
+
+The logical architecture of the data center is shown in the figure above,
analyzing from bottom to top, First, the bottom layer is the data resource
layer, which is the original data generated by various business systems; The
next layer is data integration, and the methods of data integration include
offline collection and real-time collection, of which the technologies used
include Flume, CDC real-time collection, etc.
+
+The next layer is the data lake, which puts data in the lake through data
integration, stored in Hadoop distributed storage or MPP architecture database.
+
+The next layer is the data engine layer, which processes and analyzes the data
in the data lake through real-time and offline computing engines like Flink and
Spark, form service data is available for the upper layer.
+
+The next layer is the data service that the data center needs to provide. At
present, the data service includes data development service and data sharing
service, providing data development and sharing capabilities for the upper
business systems.
+
+The data application layer is the specific application of data, including data
anomaly detection, data governance, AI decision-making, and BI analysis.
+
+In the construction of the whole data middle platform, the scheduling engine
is the core position in the data engine layer and is also an important function
in the construction of the data middle platform.
+
+### Problems and challenges faced by the data center
+The data middle office will face some problems and challenges.
+
+First of all, the execution and scheduling of data tasks are the core and key
of data development services provided by the data center.
+
+Secondly, the data center provides unified data service management, service
development, service invocation, and service monitoring.
+
+Third, ensuring the security of financial data is the primary task of FinTech,
and the data middle office needs to ensure the security and reliability of data
services.
+
+Under the above problems and challenges, we investigated some open-source
scheduling engines.
+
+
+
+At present, we use a variety of scheduling engines in the production process,
such as oozie, XXL job, and DolphinScheduler, which we introduced through
research and analysis in 2022, and plays a very important role in the
construction of the entire data center.
+
+First of all, DolphinScheduler partially addresses our requirements for
unified service management, service development, service invocation, and
service management.
+
+Secondly, it has its own unique design in task fault tolerance, supporting HA,
elastic expansion, fault tolerance, and basically ensuring the safe operation
of tasks.
+
+Third, it supports task and node monitoring.
+
+Fourth, it supports multi-tenant and permission control.
+
+Finally, its community is very active, with rapid version change and problem
repair.
+
+Through the analysis of DolphinScheduler’s architecture and source code, we
believe that its architecture conforms to the mainstream big data framework
design and has similar architecture patterns and designs with excellent foreign
products such as Hbase and Kafka.
+
+### Re-development based on DolphinScheduler
+
+To make DolphinScheduler more suitable for our application scenarios, we have
made a second transformation based on DolphinScheduler, it includes 6 aspects.
+
+* Add asynchronous service call function
+* Add Metabase Oracle adaptation
+* Add multi-environment configuration capability
+* Add log and historical data-cleaning strategy
+* Add access to Yarn logs
+* Add service security strategy
+
+### Add asynchronous service calling function
+
+First, the asynchronous service invocation function is added, the figure above
shows the architecture of DolphinScheduler version 2.0.5, and most of them are
service components of the native DolphinScheduler. GateWay marked in red is a
gateway service added based on DolphinScheduler. It realizes flow control,
black and white list, and is also the access for users to access service
development. By optimizing the startup interface of the process and returning
the unique code of the process [...]
+
+
+In the classic DolphinScheduler access mode, the workflow execution
instructions submitted by users will enter the command table in the original
database, after getting the zk lock, the master component obtains commands from
the Metabase, performs DAG parsing, generates actual process instances,
delivers the decomposed tasks to the work node for execution through RPC, and
then synchronously waits for the execution results.
+
+In the native DolphinScheduler request, After the user submits the
instruction, The return code for executing the workflow is missing, Therefore,
we have added a unique return ID, through which users can query the subsequent
process status, download logs, and download data.
+
+### Add Metabase Oracle adaptation
+Our second transformation is to adapt DolphinScheduler to the Oracle database.
At present, the metadatabase of the native DolphinScheduler is MySQL, and we
need to convert the original database into an Oracle database according to our
production needs. To achieve this, it is necessary to complete the adaptation
of the data initialization module and the data operation module.
+
+
+
+First, for the data initialization module, we modified the install_ config.
Conf configuration file to change it to the configuration of Oracle.
+
+Secondly, the Oracle application needs to be added Yml, we are in
dolphinscheduler-2.0*/ the application. yml of Oracle is added to the
apache-dolphinscheduler-2.0. * — bin/conf/directory.
+
+Finally, we convert the data operation module, Modify the mapper file and the
file, Because the Dolphinscheduler-dao module is a database operation module,
other modules will reference this module to implement database operations. It
uses Mybatis for database connection, so you need to change the mapper file,
all mapper files are in the resources directory.
+
+### Multi-environment configuration capability
+The installation of the native DolphinScheduler version cannot be configured
according to the environment, Generally, relevant parameters need to be
adjusted according to the actual environment. We want to enhance the
environment selection and configuration through the installation script, to
reduce the cost of manual online modification, Automated installation. It is
believed that all partners have encountered similar difficulties. In order to
use DolphinScheduler in a development envir [...]
+
+We modify the install Sh.file, add the input parameter [dev|test|product], and
select the appropriate install_ config_$ {evn}. Conf can be installed to
automatically select the environment.
+
+In addition, DolphinScheduler’s workflow is strongly bound to the environment,
and workflows in different environments cannot be shared. The following figure
shows the JSON file of a workflow exported by the native DolphinScheduler. The
grayed part represents the resource resources on which the process depends. The
ID is a number, which is generated by the auto-increment of the database.
However, if the process instances generated by environment a are placed in
environment b, there may b [...]
+
+
+We solve this problem by generating the absolute path of the resource as the
unique ID of the resource.
+
+### Log and historical data cleaning policy
+
+The DolphinScheduler generates a lot of data. The database will generate
instance data in the instance table, which will continue to grow with the
running of instance tasks. Our strategy is to clean up the data of these tables
according to the agreed save cycle by defining the scheduled task of
DolphinScheduler.
+
+Secondly, the data of DolphinScheduler mainly includes log data and task
execution directory, including the service log data of the worker, master, API,
and the directory executed by the worker. These data will not be automatically
deleted at the end of task execution, but also need to be deleted through
scheduled tasks. By running the log cleanup script, we can automatically delete
logs.
+
+
+
+
+
+### Increased access to Yarn logs
+
+The native DolphinScheduler can obtain the log information executed on the
worker node, but for tasks on Yarn, you need to log in to the Yarn cluster and
obtain it through the command or interface. We obtain the Yarn task ID by
analyzing the YARNID tag in the log and obtain the task log through the
yarnclient. The process of manually viewing logs is reduced.
+
+
+
+
+### Service security policy
+
+Add Monitor component monitoring
+
+
+
+
+The above figure shows the interaction between the master and worker, the two
core components of DolphinScheduler, and Zookeeper. When the MasterServer
service starts, it will register a temporary node with Zookeeper, and conduct
fault tolerance processing by listening for changes in Zookeeper temporary
nodes. WorkerServer is mainly responsible for task execution. When the
WorkerServer service starts, it registers a temporary node with Zookeeper and
maintains the heartbeat. At present, Z [...]
+
+The relevant parameters can be seen when the master and worker connect to
Zookeeper, including connection timeout, session timeout, and a maximum number
of retries.
+
+Due to network jitter and other factors, master and worker nodes may lose
connection with zk. After the loss of connection, because the temporary
information registered on the zk by the worker and master disappears, it will
be determined that the zk is lost from the master and worker, affecting the
task execution. Without human intervention, the task will be delayed. We added
the monitor component to monitor the service status. Through the scheduled task
cron, we run the monitor program [...]
+
+* Add Kerberos authentication link for service components using zk
+
+The second security policy is to add the Kerberos authentication link for
service components using zk. Kerberos is a network authentication protocol
designed to provide powerful authentication services for client/server
applications through a key system. Master service components, API service
components, and worker service components complete Kerberos authentication at
startup, and then use zk for relevant service registration and heartbeat
connection to ensure service security.
+
+### DolphinScheduler-based plugin extension
+In addition, we have extended the plug-in based on DolphinScheduler. We have
extended four types of operators, including Richshell, SparkSQL, Dataexport,
and GBase operators.
+
+### Add a new task type Richshell
+First of all, Richshell, a new task type, has enhanced the native Shell
function. It mainly realizes the dynamic replacement of script parameters
through the template engine. Users can replace script parameters through
service calls, making users more flexible in using parameters. It is a
supplement to global parameters.
+
+
+
+
+### Add a new task type SparkSQL
+
+The second operator added is SparkSQL. Users can execute Spark tasks by
writing SQL so that tasks can be scheduled on Yarn. DolphinScheduler natively
also supports SparkSQL execution in JDBC mode, but there is a situation of
resource contention because the number of JDBC connections is limited. The Yarn
cluster mode cannot be used for execution through tools such as SparkSQL/Spark
beer. By using this task type, SparkSQL programs can be run on the Yarn cluster
in cluster mode to maximize [...]
+
+### Add a new task type Dataexport
+
+The third addition is Dataexport, which is also a data export operator. Users
can export data stored in components by selecting different storage components.
Components include ES, Hive, Hbase, etc.
+
+
+The data in the big data platform may be used for BI display, statistical
analysis, machine learning, and other data preparation after being exported.
Most of these scenarios require data export, and Spark’s data processing
capability is used to achieve the export function of different data sources.
+
+### Add a new task type GBase
+The fourth plug-in added is Gbase. GBase 8a MPP Cluster is a distributed
parallel database cluster with column storage and shared nothing architecture.
It has the characteristics of high performance, high availability, high
expansion, etc. It is suitable for OLAP scenarios (query scenarios), can
provide a cost-effective general computing platform for large-scale data
management, and is widely used to support various data warehouse systems, BI
systems, and decision support systems.
+
+
+As an application scenario of data entering the lake, we have added a GBase
operator, which supports the import, export, and execution of GBase data.
+
diff --git
a/blog/en-us/Quick_Start_with_Apache_DolphinScheduler_Machine_Learning_Workflow.md
b/blog/en-us/Quick_Start_with_Apache_DolphinScheduler_Machine_Learning_Workflow.md
new file mode 100644
index 0000000000..bc78b98f16
--- /dev/null
+++
b/blog/en-us/Quick_Start_with_Apache_DolphinScheduler_Machine_Learning_Workflow.md
@@ -0,0 +1,236 @@
+---
+title:Quick Start with Apache DolphinScheduler Machine Learning Workflow
+keywords: Apache,DolphinScheduler,scheduler,big
data,ETL,airflow,hadoop,orchestration,dataops,Kubernetes,Conda
+description: With the release of Apache DolphinScheduler 3.1.0, many AI
components
+---
+# Quick Start with Apache DolphinScheduler Machine Learning Workflow
+
+## Abstract
+With the release of Apache DolphinScheduler 3.1.0, many AI components have
been added to help users to build machine learning workflows on Apache
DolphinScheduler more efficiently.
+
+This article describes in detail how to set up DolphinScheduler with some
Machine Learning environments. It also introduces the use of the MLflow
component and the DVC component with experimental examples.
+
+## DolphinScheduler and Machine Learning Environment
+Test Program
+All code can be found at
https://github.com/jieguangzhou/dolphinscheduler-ml-tutorial
+
+Get the code
+
+```git clone <https://github.com/jieguangzhou/dolphinscheduler-ml-tutorial.git>
+git checkout dev
+```
+### Installation environment
+**Conda**
+Simply install it following the official website and add the path to Conda to
the environment variables
+
+After installation mlflow and dvc commands will be installed in conda’s bin
directory.
+```
+pip install mlflow==1.30.0 dvc
+```
+
+**Java8 environment**
+
+```sudo apt-get update
+sudo apt-get install openjdk-8-jdk
+java -version
+```
+Configure the Java environment variable, ~/.bashrc or ~/.zshrc
+
+```# Confirm that your jdk is as below and configure the environment variables
+export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
+export PATH=$PATH:$JAVA_HOME/bin
+```
+
+**Apache DolphinScheduler 3.1.0**
+
+Download DolphinScheduler 3.1.0
+```
+# Go to the following directory (you can install in other directories, for the
convenience of replication, in this case, the installation is performed in the
following directory)
+cd first-example/install_dolphinscheduler
+## install DolphinScheduler
+wget
<https://dlcdn.apache.org/dolphinscheduler/3.1.0/apache-dolphinscheduler-3.1.0-bin.tar.gz>
+tar -zxvf apache-dolphinscheduler-3.1.0-bin.tar.gz
+rm apache-dolphinscheduler-3.1.0-bin.tar.gz
+```
+
+Configuring the Conda environment and Python environment in DolphinScheduler
+```
+## Configure conda environment and default python environment
+cp common.properties apache-dolphinscheduler-3.1.0-bin/standalone-server/conf
+echo "export PATH=$(which conda)/bin:\\$PATH" >>
apache-dolphinscheduler-3.1.0-bin/bin/env/dolphinscheduler_env.sh
+echo "export PYTHON_HOME=$(dirname $(which conda))/python" >>
apache-dolphinscheduler-3.1.0-bin/bin/env/dolphinscheduler_env.sh
+```
+
+* dolphinscheduler-mlflow configuration
+When using the MLFLOW component, the dolphinscheduler-mlflow project on GitHub
will be used as a reference, so if you can’t get a proper network connection,
you can replace the repository source by following these steps
+
+Firstly execute git clone
<https://github.com/apache/dolphinscheduler-mlflow.git>
+
+Then change the value of the ml.mlflow.preset_repository field in
common.properties to the default path for the download
+
+Start DolphinScheduler
+```
+## start DolphinScheduler
+cd apache-dolphinscheduler-3.1.0-bin
+bash bin/dolphinscheduler-daemon.sh start standalone-server
+## You can view the log using the following command
+# tail -500f standalone-server/logs/dolphinscheduler-standalone.log
+```
+
+Once started, wait a moment for the service to boot up and you will be taken
to the DolphinScheduler page
+
+Open http://localhost:12345/dolphinscheduler/ui and you will see the
DolphinScheduler page
+
+Account: admin, Password: dolphinscheduler123
+
+**MLflow**
+The MLflow Tracking Server is relatively simple to start up, and can simply be
started by using the command docker run — name mlflow -p 5000:5000 -d
jalonzjg/mlflow:latest
+
+Open http://localhost:5000, and you will be able to find the MLflow model and
test management page
+
+
+The Dockerfile for this mirror image can be found at
first-example/docker-mlflow/Dockerfile
+
+**Components Introduction**
+There are 5 main types of components used in this article
+
+**SHELL component**
+The SHELL component is used to run shell-type tasks
+
+**PYTHON component**
+The PYTHON component is used to run python-type tasks
+
+**CONDITIONS component**
+CONDITIONS is a conditional node that determines which downstream task should
be run based on the running status of the upstream task.
+
+**MLFLOW component**
+MLFLOW component is used to run the MLflow Project on DolphinScheduler based
on the dolphinscheduler-mlflow library to implement pre-built algorithms and
AutoML functionality for classification scenarios and to deploy models on the
MLflow tracking server
+
+**DVC component**
+DVC component is used for data versioning in machine learning on
DolphinScheduler, such as registering specific data as a specific version and
downloading specific versions of data.
+
+Among the above five components
+
+* SHELL component and PYTHON component are the base components, which can run
a wide range of tasks.
+* CONDITIONS are logical components that can dynamically control the logic of
the workflow’s operation.
+* The MLFLOW component and DVC component are machine learning type components
that can be used to facilitate the ease of use of machine learning scenario
feature capabilities within the workflow.
+Machine learning workflow
+The workflow consists of three parts.
+
+* The first part is the preliminary preparation, such as data download, data
versioning management repository, etc.; it is a one-time preparation.
+* The second part is the training model workflow: it includes data
pre-processing, training model, and model evaluation
+* The third part is the deployment workflow, which includes model deployment
and interface testing.
+
+Preliminary preparation workflow
+Create a directory to store all the process data mkdir /tmp/ds-ml-example
+
+At the beginning of the program, we need to download the test data and
initialize the DVC repository for data versioning
+
+All the following commands are run in the
dolphinscheduler-ml-tutorial/first-example directory
+
+Since we are submitting the workflow via pydolphinscheduler, let’s install pip
install apache-dolphinscheduler==3.1.0
+
+Workflow(download-data): Downloading test data
+
+Command: pydolphinscheduler yaml -f pyds/download_data.yaml
+
+Execute the following two tasks in order
+
+1. Install-dependencies: install the python dependencies packages needed in
the download script
+
+2. Download-data: download the dataset to /tmp/ds-ml-example/raw
+
+
+Workflow(dvc_init_local): Initialize the dvc data versioning management
repository
+
+Command: pydolphinscheduler yaml -f pyds/init_dvc_repo.yaml
+
+Execute the following tasks in order
+
+1. create_git_repo: Create an empty git repository in the local environment
+
+2. init_dvc: convert the repository to a dvc-type repository for data
versioning
+
+3. condition: determine the status of the init_dvc task, if successful then
execute report_success_message, otherwise execute report_error_message
+
+
+Training model workflow
+In the training model part, the workflow includes data pre-processing, model
training, and model evaluation.
+
+Workflow(download-data): data preprocessing
+
+Command: pydolphinscheduler yaml -f pyds/prepare_data.yaml
+
+
+Perform the following tasks in order
+
+1. data_preprocessing: preprocesses the data, for demo purposes, we’ve only
perform a simple truncation procedure here
+
+2. upload_data: uploads data to the repository and registers it as a specific
version v1
+
+The following image shows the information in the git repository
+
+
+Workflow(train_model): Training model
+
+Command: pydolphinscheduler yaml -f pyds/train_model.yaml
+
+Perform the following tasks in order
+
+1. clean_exists_data: Delete the historical data generated by potentially
repeated tests /tmp/ds-ml-example/train_data
+
+2. pull_data: pull v1 data to /tmp/ds-ml-example/train_data
+
+3. train_automl: Uses the MLFLOW component’s AutoML function to train the
classification model and register it with the MLflow Tracking Server, if the
current model version F1 is the highest, then register it as the Production
version.
+
+4. inference: import a small part of the data for batch inference using the
mlflow CLI
+
+5. evaluate: Obtain the results of the inference and perform a simple
evaluation of the model again, which includes the metrics of the new data, the
projected label distribution, etc.
+
+
+
+The results of the test and the model can be viewed in the MLflow Tracking
Server ( http://localhost:5000 ) after train_automl has completed its operation.
+
+
+The operation logs for the evaluation task can be viewed after it has
completed its operation.
+
+
+Deployment Process Workflow
+Workflow(deploy_model): Deployment model
+
+Run: pydolphinscheduler yaml -f pyds/deploy.yaml
+
+Run the following tasks in order.
+
+1. kill-server: Shut down the previous server
+
+2. deploy-model: Deploy the model
+
+3. test-server: Test the server
+
+
+If this workflow is started manually, the interface will look as follows, just
enter the port number and the model version number.
+
+
+Integrate the workflows
+For practical use, after obtaining stable workflow iterations, the whole
process needs to be linked together, for example after getting a new version,
then train the model, and if it performs better, then deploy the model.
+
+For example, we switch to the production version git checkout
first-example-production
+
+The differences between the two versions are:
+
+1. there is an additional workflow definition in train_and_deploy.yaml, which
is used to combine the various workflows
+
+2. modify the pre-processing script to get the v2 data
+
+3. change the flag in the definition of each sub-workflow to false and let
train_and_deploy.yaml run in unison.
+
+Run: pydolphinscheduler yaml -f pyds/train_and_deploy.yaml
+
+Each task in the diagram below is a sub-workflow task, which corresponds to
the three workflows described above.
+
+
+As below, the new version of the model, version2, is obtained after the
operation and has been registered as the Production version
+
+
+
diff --git a/blog/img/media/16720397220045/16720397367629.jpg
b/blog/img/media/16720397220045/16720397367629.jpg
new file mode 100644
index 0000000000..38390fff76
Binary files /dev/null and b/blog/img/media/16720397220045/16720397367629.jpg
differ
diff --git a/blog/img/media/16720400637574/16720400704016.jpg
b/blog/img/media/16720400637574/16720400704016.jpg
new file mode 100644
index 0000000000..c900e1f6f0
Binary files /dev/null and b/blog/img/media/16720400637574/16720400704016.jpg
differ
diff --git a/blog/img/media/16720400637574/16720400759248.jpg
b/blog/img/media/16720400637574/16720400759248.jpg
new file mode 100644
index 0000000000..2c6dd50560
Binary files /dev/null and b/blog/img/media/16720400637574/16720400759248.jpg
differ
diff --git a/blog/img/media/16720400637574/16720401185208.jpg
b/blog/img/media/16720400637574/16720401185208.jpg
new file mode 100644
index 0000000000..6f65489aa6
Binary files /dev/null and b/blog/img/media/16720400637574/16720401185208.jpg
differ
diff --git a/blog/img/media/16720400637574/16720401253440.jpg
b/blog/img/media/16720400637574/16720401253440.jpg
new file mode 100644
index 0000000000..4d9db4352b
Binary files /dev/null and b/blog/img/media/16720400637574/16720401253440.jpg
differ
diff --git a/blog/img/media/16720400637574/16720401472681.jpg
b/blog/img/media/16720400637574/16720401472681.jpg
new file mode 100644
index 0000000000..ef2743f6bd
Binary files /dev/null and b/blog/img/media/16720400637574/16720401472681.jpg
differ
diff --git a/blog/img/media/16720400637574/16720402083983.jpg
b/blog/img/media/16720400637574/16720402083983.jpg
new file mode 100644
index 0000000000..2a6855fdb0
Binary files /dev/null and b/blog/img/media/16720400637574/16720402083983.jpg
differ
diff --git a/blog/img/media/16720400637574/16720402291980.jpg
b/blog/img/media/16720400637574/16720402291980.jpg
new file mode 100644
index 0000000000..e6d516381c
Binary files /dev/null and b/blog/img/media/16720400637574/16720402291980.jpg
differ
diff --git a/blog/img/media/16720400637574/16720402508893.jpg
b/blog/img/media/16720400637574/16720402508893.jpg
new file mode 100644
index 0000000000..66135c4c22
Binary files /dev/null and b/blog/img/media/16720400637574/16720402508893.jpg
differ
diff --git a/blog/img/media/16720400637574/16720402711565.jpg
b/blog/img/media/16720400637574/16720402711565.jpg
new file mode 100644
index 0000000000..75d2bcfa18
Binary files /dev/null and b/blog/img/media/16720400637574/16720402711565.jpg
differ
diff --git a/blog/img/media/16720400637574/16720402758234.jpg
b/blog/img/media/16720400637574/16720402758234.jpg
new file mode 100644
index 0000000000..4f2b486f97
Binary files /dev/null and b/blog/img/media/16720400637574/16720402758234.jpg
differ
diff --git a/blog/img/media/16720400637574/16720403297820.jpg
b/blog/img/media/16720400637574/16720403297820.jpg
new file mode 100644
index 0000000000..d2ddbb7512
Binary files /dev/null and b/blog/img/media/16720400637574/16720403297820.jpg
differ
diff --git a/blog/img/media/16720400637574/16720403572773.jpg
b/blog/img/media/16720400637574/16720403572773.jpg
new file mode 100644
index 0000000000..369e1c9dc2
Binary files /dev/null and b/blog/img/media/16720400637574/16720403572773.jpg
differ
diff --git a/blog/img/media/16720400637574/16720403977529.jpg
b/blog/img/media/16720400637574/16720403977529.jpg
new file mode 100644
index 0000000000..3c546a6219
Binary files /dev/null and b/blog/img/media/16720400637574/16720403977529.jpg
differ
diff --git a/blog/img/media/16720400637574/16720404142720.jpg
b/blog/img/media/16720400637574/16720404142720.jpg
new file mode 100644
index 0000000000..22673aa79a
Binary files /dev/null and b/blog/img/media/16720400637574/16720404142720.jpg
differ
diff --git a/blog/img/media/16720400637574/16720404412259.jpg
b/blog/img/media/16720400637574/16720404412259.jpg
new file mode 100644
index 0000000000..2c4d8d2eaa
Binary files /dev/null and b/blog/img/media/16720400637574/16720404412259.jpg
differ
diff --git a/blog/img/media/16720405454837/16720405586499.jpg
b/blog/img/media/16720405454837/16720405586499.jpg
new file mode 100644
index 0000000000..dff47f339b
Binary files /dev/null and b/blog/img/media/16720405454837/16720405586499.jpg
differ
diff --git a/blog/img/media/16720405454837/16720407528096.jpg
b/blog/img/media/16720405454837/16720407528096.jpg
new file mode 100644
index 0000000000..cf4a1c2ae5
Binary files /dev/null and b/blog/img/media/16720405454837/16720407528096.jpg
differ
diff --git a/blog/img/media/16720405454837/16720407653742.jpg
b/blog/img/media/16720405454837/16720407653742.jpg
new file mode 100644
index 0000000000..7483499fd4
Binary files /dev/null and b/blog/img/media/16720405454837/16720407653742.jpg
differ
diff --git a/blog/img/media/16720405454837/16720408372893.jpg
b/blog/img/media/16720405454837/16720408372893.jpg
new file mode 100644
index 0000000000..e4a4dc49e7
Binary files /dev/null and b/blog/img/media/16720405454837/16720408372893.jpg
differ
diff --git a/blog/img/media/16720405454837/16720408471707.jpg
b/blog/img/media/16720405454837/16720408471707.jpg
new file mode 100644
index 0000000000..435a69bc29
Binary files /dev/null and b/blog/img/media/16720405454837/16720408471707.jpg
differ
diff --git a/blog/img/media/16720405454837/16720408537181.jpg
b/blog/img/media/16720405454837/16720408537181.jpg
new file mode 100644
index 0000000000..22c055187c
Binary files /dev/null and b/blog/img/media/16720405454837/16720408537181.jpg
differ
diff --git a/blog/img/media/16720405454837/16720408664980.jpg
b/blog/img/media/16720405454837/16720408664980.jpg
new file mode 100644
index 0000000000..d2a3e087c4
Binary files /dev/null and b/blog/img/media/16720405454837/16720408664980.jpg
differ
diff --git a/blog/img/media/16720405454837/16720408742949.jpg
b/blog/img/media/16720405454837/16720408742949.jpg
new file mode 100644
index 0000000000..8c403a6582
Binary files /dev/null and b/blog/img/media/16720405454837/16720408742949.jpg
differ
diff --git a/blog/img/media/16720405454837/16720408868765.jpg
b/blog/img/media/16720405454837/16720408868765.jpg
new file mode 100644
index 0000000000..d38346a64f
Binary files /dev/null and b/blog/img/media/16720405454837/16720408868765.jpg
differ
diff --git a/blog/img/media/16720405454837/16720408963992.jpg
b/blog/img/media/16720405454837/16720408963992.jpg
new file mode 100644
index 0000000000..538f2f239e
Binary files /dev/null and b/blog/img/media/16720405454837/16720408963992.jpg
differ
diff --git a/blog/img/media/16720405454837/16720409057879.jpg
b/blog/img/media/16720405454837/16720409057879.jpg
new file mode 100644
index 0000000000..df8deba7a0
Binary files /dev/null and b/blog/img/media/16720405454837/16720409057879.jpg
differ
diff --git a/blog/img/media/16720405454837/16720409115839.jpg
b/blog/img/media/16720405454837/16720409115839.jpg
new file mode 100644
index 0000000000..b6899fc11b
Binary files /dev/null and b/blog/img/media/16720405454837/16720409115839.jpg
differ
diff --git a/blog/img/media/16720405454837/16720409204499.jpg
b/blog/img/media/16720405454837/16720409204499.jpg
new file mode 100644
index 0000000000..9e5e2d180a
Binary files /dev/null and b/blog/img/media/16720405454837/16720409204499.jpg
differ
diff --git a/blog/img/media/16720405454837/16720409274430.jpg
b/blog/img/media/16720405454837/16720409274430.jpg
new file mode 100644
index 0000000000..6cabb7a729
Binary files /dev/null and b/blog/img/media/16720405454837/16720409274430.jpg
differ
diff --git a/config/blog/en-us/release.json b/config/blog/en-us/release.json
index cca5b78b9d..2fb3a43560 100644
--- a/config/blog/en-us/release.json
+++ b/config/blog/en-us/release.json
@@ -1,5 +1,11 @@
{
+ "Apache_dolphinScheduler_3.1.2": {
+ "title": "Apache DolphinScheduler releases version 3.1.2 with Python API
optimizations",
+ "author": "Leonard Nie",
+ "dateStr": "2022-12-24",
+ "desc": "Recently, Apache DolphinScheduler released version 3.1.2........ "
+ },
"Apache_dolphinScheduler_3.0.3": {
"title": "DolphinScheduler released version 3.0.3, focusing on fixing 6
bugs",
"author": "Leonard Nie",
diff --git a/config/blog/en-us/tech.json b/config/blog/en-us/tech.json
index 11fa91973a..45412d474e 100644
--- a/config/blog/en-us/tech.json
+++ b/config/blog/en-us/tech.json
@@ -1,4 +1,5 @@
{
+
"DolphinScheduler_python_api_ci_cd": {
"title": "DolphinScheduler Python API CI/CD",
"author": "Leonard Nie",
@@ -11,6 +12,12 @@
"dateStr": "2022-12-10",
"desc": "Apache DolphinScheduler has officially launched on the AWS EC2
AMI application marketplace... "
},
+ "Quick_Start_with_Apache_DolphinScheduler_Machine_Learning_Workflow": {
+ "title": "Quick Start with Apache DolphinScheduler Machine Learning
Workflow",
+ "author": "Leonard Nie",
+ "dateStr": "2022-12-5",
+ "desc": "With the release of Apache DolphinScheduler 3.1.0, many AI
components... "
+ },
"How_can_more_people_benefit_from_big_data": {
"title": "How can more people benefit from big data?",
"author": "Leonard Nie",
diff --git a/config/blog/en-us/user.json b/config/blog/en-us/user.json
index ea0e7d697f..e902f0fd52 100644
--- a/config/blog/en-us/user.json
+++ b/config/blog/en-us/user.json
@@ -1,4 +1,11 @@
-{
+{
"Application_transformation_of_the_FinTech_data_center_based_on_DolphinScheduler":
{
+ "title": "Application transformation of the FinTech data center based on
DolphinScheduler",
+ "author": "Leonard Nie",
+ "dateStr": "2022-12-6",
+ "desc": "On Apache DolphinScheduler Meetup last week, ... ",
+ "img": "/img/media/16720400637574/16720400704016.jpg",
+ "logo": ""
+},
"How_did_Yili_explore_a_path_for_digital_transformation_based_on_DolphinScheduler":
{
"title": "How did Yili explore a “path” for digital transformation based
on DolphinScheduler?",
"author": "Debra Chen",