[dolphinscheduler-website] branch master updated: ADD Blog (#871)

wanggenhua Mon, 26 Dec 2022 01:23:03 -0800

This is an automated email from the ASF dual-hosted git repository.

wanggenhua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/dolphinscheduler-website.git



The following commit(s) were added to refs/heads/master by this push:
     new b021424931 ADD Blog (#871)
b021424931 is described below

commit b0214249314e125cb2a1ccca934403def149fe6c
Author: lifeng <[email protected]>
AuthorDate: Mon Dec 26 17:22:45 2022 +0800

    ADD Blog (#871)
    
    * ADD Blog
    
    * updata
    
    * updata
---
 blog/en-us/Apache_dolphinScheduler_3.1.2.md        |  68 ++++++
 ...inTech_data_center_based_on_DolphinScheduler.md | 176 +++++++++++++++
 ...e_DolphinScheduler_Machine_Learning_Workflow.md | 236 +++++++++++++++++++++
 blog/img/media/16720397220045/16720397367629.jpg   | Bin 0 -> 59519 bytes
 blog/img/media/16720400637574/16720400704016.jpg   | Bin 0 -> 83706 bytes
 blog/img/media/16720400637574/16720400759248.jpg   | Bin 0 -> 58667 bytes
 blog/img/media/16720400637574/16720401185208.jpg   | Bin 0 -> 46943 bytes
 blog/img/media/16720400637574/16720401253440.jpg   | Bin 0 -> 126403 bytes
 blog/img/media/16720400637574/16720401472681.jpg   | Bin 0 -> 67800 bytes
 blog/img/media/16720400637574/16720402083983.jpg   | Bin 0 -> 33154 bytes
 blog/img/media/16720400637574/16720402291980.jpg   | Bin 0 -> 63322 bytes
 blog/img/media/16720400637574/16720402508893.jpg   | Bin 0 -> 65984 bytes
 blog/img/media/16720400637574/16720402711565.jpg   | Bin 0 -> 27768 bytes
 blog/img/media/16720400637574/16720402758234.jpg   | Bin 0 -> 50365 bytes
 blog/img/media/16720400637574/16720403297820.jpg   | Bin 0 -> 61933 bytes
 blog/img/media/16720400637574/16720403572773.jpg   | Bin 0 -> 25883 bytes
 blog/img/media/16720400637574/16720403977529.jpg   | Bin 0 -> 47344 bytes
 blog/img/media/16720400637574/16720404142720.jpg   | Bin 0 -> 34341 bytes
 blog/img/media/16720400637574/16720404412259.jpg   | Bin 0 -> 33141 bytes
 blog/img/media/16720405454837/16720405586499.jpg   | Bin 0 -> 155526 bytes
 blog/img/media/16720405454837/16720407528096.jpg   | Bin 0 -> 53369 bytes
 blog/img/media/16720405454837/16720407653742.jpg   | Bin 0 -> 33004 bytes
 blog/img/media/16720405454837/16720408372893.jpg   | Bin 0 -> 32516 bytes
 blog/img/media/16720405454837/16720408471707.jpg   | Bin 0 -> 40498 bytes
 blog/img/media/16720405454837/16720408537181.jpg   | Bin 0 -> 25097 bytes
 blog/img/media/16720405454837/16720408664980.jpg   | Bin 0 -> 31613 bytes
 blog/img/media/16720405454837/16720408742949.jpg   | Bin 0 -> 75740 bytes
 blog/img/media/16720405454837/16720408868765.jpg   | Bin 0 -> 36381 bytes
 blog/img/media/16720405454837/16720408963992.jpg   | Bin 0 -> 60099 bytes
 blog/img/media/16720405454837/16720409057879.jpg   | Bin 0 -> 42155 bytes
 blog/img/media/16720405454837/16720409115839.jpg   | Bin 0 -> 30300 bytes
 blog/img/media/16720405454837/16720409204499.jpg   | Bin 0 -> 14531 bytes
 blog/img/media/16720405454837/16720409274430.jpg   | Bin 0 -> 25945 bytes
 config/blog/en-us/release.json                     |   6 +
 config/blog/en-us/tech.json                        |   7 +
 config/blog/en-us/user.json                        |   9 +-
 36 files changed, 501 insertions(+), 1 deletion(-)

diff --git a/blog/en-us/Apache_dolphinScheduler_3.1.2.md 
b/blog/en-us/Apache_dolphinScheduler_3.1.2.md
new file mode 100644
index 0000000000..19a69f25ea
--- /dev/null
+++ b/blog/en-us/Apache_dolphinScheduler_3.1.2.md
@@ -0,0 +1,68 @@
+---
+title: Apache DolphinScheduler releases version 3.1.2 with Python API 
optimizations
+keywords: Apache,DolphinScheduler,scheduler,big 
data,ETL,airflow,hadoop,orchestration,dataops,Kubernetes
+description: Recently, Apache DolphinScheduler released version 3.1.2.
+---
+# Apache DolphinScheduler releases version 3.1.2 with Python API optimizations
+![](/img/media/16720397220045/16720397367629.jpg)
+Recently, Apache DolphinScheduler released version 3.1.2. This version is 
mainly based on version 3.1.2, with 6 Python API optimizations, 19 bug fixes, 
and 4 document updates.
+
+## Important bug fixes:
+
+* Worker kill process does not take effect #12995
+* Complement dependency mode generates wrong workflow instance (#13009)
+* Python task parameter passing error (#12961)
+* Fix dependency task null pointer (#12965)
+* Task retry error (#12903)
+* Shell task calls dolphinscheduler_env.sh configuration file exception 
(#12909)
+* Corrected documentation for multiple Hive SQL runs (#12765)
+* Added token authentication for Python API #12893
+
+## Change Log
+
+### Bug fix
+* [Improvement] change alert start.sh (#13100)
+* [Fix] Add token as authentication for python gateway (#12893)
+* [Fix-13010] [Task] The Flink SQL task page selects the pre-job deployment 
mode, but the task executed by the worker is the Flink local mode
+* [Fix-12997][API] Fix that the end time is not reset when the workflow 
instance reruns. (#12998)
+* [Fix-12994] [Worker] Fix kill process does not take effect (#12995)
+* Fix sql task will send alert if we don’t choose the send email #12984
+* [Fix-13008] [UI] When using the complement function, turn on the dependent 
mode to generate multiple unrelated workflow instances (#13009)
+* [Fix][doc] python api release link
+* [Fix] Python task can not pass the parameters to downstream task. (#12961)
+* [Fix] Fix Java path in Kubernetes Helm Chart (#12987)
+* [Fix-12963] [Master] Fix dependent task node null pointer exception (#12965)
+* [Fix-12954] [Schedule] Fix that workflow-level configuration information 
does not take effect when timing triggers execution
+* Fix execute shell task exception no dolphinscheduler_env.sh file execute 
permission (#12909)
+* Upgrade clickhouse jdbc driver #12639
+* add spring-context to alert api (#12892)
+* [Upgrade][SQL]Modify the table t_ds_worker_group to add a description field 
in the postgresql upgrade script #12883
+* Fix NPE while retry task (#12903)
+* Fix-12832][API] Fix update worker group exception group name already exists. 
#12874
+* Fix and enhance helm db config (#12707)
+
+### Document
+* [Fix][Doc] Fix sql-hive and hive-cli doc (#12765)
+* [Fix][Alert] Ignore alert not write info to db (#12867)
+* [Doc] Add skip spotless check during ASF release #12835
+* [Doc][Bug] Fix dead link caused by markdown cross-files anchor #12357 
(#12877)
+
+### Python API
+* [Fix] python API upload resource center failed
+* [Feature] Add CURD to the project/tenant/user section of the python-DS 
(#11162)
+* [Chore][Python] Change name from process definition to workflow (#12918)
+* [Feature] Support set execute type to pydolphinscheduler (#12871)
+* [Hotfix] Correct python doc link
+* [Improvement][Python] Validate version of Python API at launch (#11626)
+
+## Acknowledgment
+
+Thanks to all community contributors who participated in the release of Apache 
DolphinScheduler 3.1.2. Below is the list of the contributors by GitHub ID, in 
no particular order.
+
+
+
+| liqingwang   | liqingwang    | hezean       |
+|--------------|-------------|--------------|
+| ruanwenjun | simsicon | jieguangzhou |
+| Tianqi-Dotes  | zhuangchong | zhongjiajie |
+
diff --git 
a/blog/en-us/Application_transformation_of_the_FinTech_data_center_based_on_DolphinScheduler.md
 
b/blog/en-us/Application_transformation_of_the_FinTech_data_center_based_on_DolphinScheduler.md
new file mode 100644
index 0000000000..0d6e2b5703
--- /dev/null
+++ 
b/blog/en-us/Application_transformation_of_the_FinTech_data_center_based_on_DolphinScheduler.md
@@ -0,0 +1,176 @@
+---
+title:Application transformation of the FinTech data center based on 
DolphinScheduler
+keywords: Apache,DolphinScheduler,scheduler,big 
data,ETL,airflow,hadoop,orchestration,dataops,Kubernetes
+description: On Apache DolphinScheduler Meetup last week, Feng Mingxia,
+---
+# Application transformation of the FinTech data center based on 
DolphinScheduler
+![](/img/media/16720400637574/16720400704016.jpg)
+On Apache DolphinScheduler Meetup last week, Feng Mingxia, a big data engineer 
from Chengfang FinTech, brought us the application practice of DolphinScheduler 
in the field of FinTech. The following is the presentation.
+
+![](/img/media/16720400637574/16720400759248.jpg)
+Feng Mingxia, Chengfang Financial Technology Big Data Engineer
+
+Focusing on real-time and offline data processing and analysis in the field of 
big data, at present, he is mainly responsible for the research and development 
of data middle platforms.
+
+Speech summary:
+
+· Use background
+
+· Secondary transformation based on DolphinScheduler
+
+· DolphinScheduler plug-in expansion
+
+· Future and outlook
+
+## Use Background
+
+### Data Center Construction
+
+At present, big data technology is widely used in the financial field, and the 
big data platform has become a financial infrastructure. In the construction of 
a big data platform, the data center is the brightest star, which is the 
entrance and interface for business systems to use big data, when various 
business systems are connected to the data center, the data middle office needs 
to provide unified management and unified access to ensure the security, 
reliability, efficiency, and reli [...]
+
+As shown in the figure below, the data middle office is in the middle link 
between the business systems and the big data platform, each business system 
accesses the big data platform through the services provided by the data center.
+
+![](/img/media/16720400637574/16720401185208.jpg)
+The core concept of the data middle office is to realize four modernizations, 
namely, business data, data asset, asset service, and service business. From 
business to data, and back to the complete closed loop formed by business, 
support the digital transformation of enterprises.
+
+![](/img/media/16720400637574/16720401253440.jpg)
+The logical architecture of the data center is shown in the figure above, 
analyzing from bottom to top, First, the bottom layer is the data resource 
layer, which is the original data generated by various business systems; The 
next layer is data integration, and the methods of data integration include 
offline collection and real-time collection, of which the technologies used 
include Flume, CDC real-time collection, etc.
+
+The next layer is the data lake, which puts data in the lake through data 
integration, stored in Hadoop distributed storage or MPP architecture database.
+
+The next layer is the data engine layer, which processes and analyzes the data 
in the data lake through real-time and offline computing engines like Flink and 
Spark, form service data is available for the upper layer.
+
+The next layer is the data service that the data center needs to provide. At 
present, the data service includes data development service and data sharing 
service, providing data development and sharing capabilities for the upper 
business systems.
+
+The data application layer is the specific application of data, including data 
anomaly detection, data governance, AI decision-making, and BI analysis.
+
+In the construction of the whole data middle platform, the scheduling engine 
is the core position in the data engine layer and is also an important function 
in the construction of the data middle platform.
+
+### Problems and challenges faced by the data center
+The data middle office will face some problems and challenges.
+
+First of all, the execution and scheduling of data tasks are the core and key 
of data development services provided by the data center.
+
+Secondly, the data center provides unified data service management, service 
development, service invocation, and service monitoring.
+
+Third, ensuring the security of financial data is the primary task of FinTech, 
and the data middle office needs to ensure the security and reliability of data 
services.
+
+Under the above problems and challenges, we investigated some open-source 
scheduling engines.
+
+![](/img/media/16720400637574/16720401472681.jpg)
+
+At present, we use a variety of scheduling engines in the production process, 
such as oozie, XXL job, and DolphinScheduler, which we introduced through 
research and analysis in 2022, and plays a very important role in the 
construction of the entire data center.
+
+First of all, DolphinScheduler partially addresses our requirements for 
unified service management, service development, service invocation, and 
service management.
+
+Secondly, it has its own unique design in task fault tolerance, supporting HA, 
elastic expansion, fault tolerance, and basically ensuring the safe operation 
of tasks.
+
+Third, it supports task and node monitoring.
+
+Fourth, it supports multi-tenant and permission control.
+
+Finally, its community is very active, with rapid version change and problem 
repair.
+
+Through the analysis of DolphinScheduler’s architecture and source code, we 
believe that its architecture conforms to the mainstream big data framework 
design and has similar architecture patterns and designs with excellent foreign 
products such as Hbase and Kafka.
+
+### Re-development based on DolphinScheduler
+
+To make DolphinScheduler more suitable for our application scenarios, we have 
made a second transformation based on DolphinScheduler, it includes 6 aspects.
+
+* Add asynchronous service call function
+* Add Metabase Oracle adaptation
+* Add multi-environment configuration capability
+* Add log and historical data-cleaning strategy
+* Add access to Yarn logs
+* Add service security strategy
+
+### Add asynchronous service calling function
+
+First, the asynchronous service invocation function is added, the figure above 
shows the architecture of DolphinScheduler version 2.0.5, and most of them are 
service components of the native DolphinScheduler. GateWay marked in red is a 
gateway service added based on DolphinScheduler. It realizes flow control, 
black and white list, and is also the access for users to access service 
development. By optimizing the startup interface of the process and returning 
the unique code of the process [...]
+
+![](/img/media/16720400637574/16720402083983.jpg)
+In the classic DolphinScheduler access mode, the workflow execution 
instructions submitted by users will enter the command table in the original 
database, after getting the zk lock, the master component obtains commands from 
the Metabase, performs DAG parsing, generates actual process instances, 
delivers the decomposed tasks to the work node for execution through RPC, and 
then synchronously waits for the execution results.
+
+In the native DolphinScheduler request, After the user submits the 
instruction, The return code for executing the workflow is missing, Therefore, 
we have added a unique return ID, through which users can query the subsequent 
process status, download logs, and download data.
+
+### Add Metabase Oracle adaptation
+Our second transformation is to adapt DolphinScheduler to the Oracle database. 
At present, the metadatabase of the native DolphinScheduler is MySQL, and we 
need to convert the original database into an Oracle database according to our 
production needs. To achieve this, it is necessary to complete the adaptation 
of the data initialization module and the data operation module.
+
+![](/img/media/16720400637574/16720402291980.jpg)
+
+First, for the data initialization module, we modified the install_ config. 
Conf configuration file to change it to the configuration of Oracle.
+
+Secondly, the Oracle application needs to be added Yml, we are in 
dolphinscheduler-2.0*/ the application. yml of Oracle is added to the 
apache-dolphinscheduler-2.0. * — bin/conf/directory.
+
+Finally, we convert the data operation module, Modify the mapper file and the 
file, Because the Dolphinscheduler-dao module is a database operation module, 
other modules will reference this module to implement database operations. It 
uses Mybatis for database connection, so you need to change the mapper file, 
all mapper files are in the resources directory.
+
+### Multi-environment configuration capability
+The installation of the native DolphinScheduler version cannot be configured 
according to the environment, Generally, relevant parameters need to be 
adjusted according to the actual environment. We want to enhance the 
environment selection and configuration through the installation script, to 
reduce the cost of manual online modification, Automated installation. It is 
believed that all partners have encountered similar difficulties. In order to 
use DolphinScheduler in a development envir [...]
+
+We modify the install Sh.file, add the input parameter [dev|test|product], and 
select the appropriate install_ config_$ {evn}. Conf can be installed to 
automatically select the environment.
+
+In addition, DolphinScheduler’s workflow is strongly bound to the environment, 
and workflows in different environments cannot be shared. The following figure 
shows the JSON file of a workflow exported by the native DolphinScheduler. The 
grayed part represents the resource resources on which the process depends. The 
ID is a number, which is generated by the auto-increment of the database. 
However, if the process instances generated by environment a are placed in 
environment b, there may b [...]
+
+![](/img/media/16720400637574/16720402508893.jpg)
+We solve this problem by generating the absolute path of the resource as the 
unique ID of the resource.
+
+### Log and historical data cleaning policy
+
+The DolphinScheduler generates a lot of data. The database will generate 
instance data in the instance table, which will continue to grow with the 
running of instance tasks. Our strategy is to clean up the data of these tables 
according to the agreed save cycle by defining the scheduled task of 
DolphinScheduler.
+
+Secondly, the data of DolphinScheduler mainly includes log data and task 
execution directory, including the service log data of the worker, master, API, 
and the directory executed by the worker. These data will not be automatically 
deleted at the end of task execution, but also need to be deleted through 
scheduled tasks. By running the log cleanup script, we can automatically delete 
logs.
+
+![](/img/media/16720400637574/16720402711565.jpg)
+![](/img/media/16720400637574/16720402758234.jpg)
+
+
+###  Increased access to Yarn logs
+
+The native DolphinScheduler can obtain the log information executed on the 
worker node, but for tasks on Yarn, you need to log in to the Yarn cluster and 
obtain it through the command or interface. We obtain the Yarn task ID by 
analyzing the YARNID tag in the log and obtain the task log through the 
yarnclient. The process of manually viewing logs is reduced.
+
+![](/img/media/16720400637574/16720403297820.jpg)
+
+
+### Service security policy
+
+Add Monitor component monitoring
+
+![](/img/media/16720400637574/16720403572773.jpg)
+
+
+The above figure shows the interaction between the master and worker, the two 
core components of DolphinScheduler, and Zookeeper. When the MasterServer 
service starts, it will register a temporary node with Zookeeper, and conduct 
fault tolerance processing by listening for changes in Zookeeper temporary 
nodes. WorkerServer is mainly responsible for task execution. When the 
WorkerServer service starts, it registers a temporary node with Zookeeper and 
maintains the heartbeat. At present, Z [...]
+
+The relevant parameters can be seen when the master and worker connect to 
Zookeeper, including connection timeout, session timeout, and a maximum number 
of retries.
+
+Due to network jitter and other factors, master and worker nodes may lose 
connection with zk. After the loss of connection, because the temporary 
information registered on the zk by the worker and master disappears, it will 
be determined that the zk is lost from the master and worker, affecting the 
task execution. Without human intervention, the task will be delayed. We added 
the monitor component to monitor the service status. Through the scheduled task 
cron, we run the monitor program  [...]
+
+* Add Kerberos authentication link for service components using zk
+
+The second security policy is to add the Kerberos authentication link for 
service components using zk. Kerberos is a network authentication protocol 
designed to provide powerful authentication services for client/server 
applications through a key system. Master service components, API service 
components, and worker service components complete Kerberos authentication at 
startup, and then use zk for relevant service registration and heartbeat 
connection to ensure service security.
+
+### DolphinScheduler-based plugin extension
+In addition, we have extended the plug-in based on DolphinScheduler. We have 
extended four types of operators, including Richshell, SparkSQL, Dataexport, 
and GBase operators.
+
+### Add a new task type Richshell
+First of all, Richshell, a new task type, has enhanced the native Shell 
function. It mainly realizes the dynamic replacement of script parameters 
through the template engine. Users can replace script parameters through 
service calls, making users more flexible in using parameters. It is a 
supplement to global parameters.
+
+![](/img/media/16720400637574/16720403977529.jpg)
+
+
+### Add a new task type SparkSQL
+
+The second operator added is SparkSQL. Users can execute Spark tasks by 
writing SQL so that tasks can be scheduled on Yarn. DolphinScheduler natively 
also supports SparkSQL execution in JDBC mode, but there is a situation of 
resource contention because the number of JDBC connections is limited. The Yarn 
cluster mode cannot be used for execution through tools such as SparkSQL/Spark 
beer. By using this task type, SparkSQL programs can be run on the Yarn cluster 
in cluster mode to maximize  [...]
+
+### Add a new task type Dataexport
+
+The third addition is Dataexport, which is also a data export operator. Users 
can export data stored in components by selecting different storage components. 
Components include ES, Hive, Hbase, etc.
+
+![](/img/media/16720400637574/16720404142720.jpg)
+The data in the big data platform may be used for BI display, statistical 
analysis, machine learning, and other data preparation after being exported. 
Most of these scenarios require data export, and Spark’s data processing 
capability is used to achieve the export function of different data sources.
+
+### Add a new task type GBase
+The fourth plug-in added is Gbase. GBase 8a MPP Cluster is a distributed 
parallel database cluster with column storage and shared nothing architecture. 
It has the characteristics of high performance, high availability, high 
expansion, etc. It is suitable for OLAP scenarios (query scenarios), can 
provide a cost-effective general computing platform for large-scale data 
management, and is widely used to support various data warehouse systems, BI 
systems, and decision support systems.
+
+![](/img/media/16720400637574/16720404412259.jpg)
+As an application scenario of data entering the lake, we have added a GBase 
operator, which supports the import, export, and execution of GBase data.
+
diff --git 
a/blog/en-us/Quick_Start_with_Apache_DolphinScheduler_Machine_Learning_Workflow.md
 
b/blog/en-us/Quick_Start_with_Apache_DolphinScheduler_Machine_Learning_Workflow.md
new file mode 100644
index 0000000000..bc78b98f16
--- /dev/null
+++ 
b/blog/en-us/Quick_Start_with_Apache_DolphinScheduler_Machine_Learning_Workflow.md
@@ -0,0 +1,236 @@
+---
+title:Quick Start with Apache DolphinScheduler Machine Learning Workflow
+keywords: Apache,DolphinScheduler,scheduler,big 
data,ETL,airflow,hadoop,orchestration,dataops,Kubernetes,Conda
+description: With the release of Apache DolphinScheduler 3.1.0, many AI 
components
+---
+# Quick Start with Apache DolphinScheduler Machine Learning Workflow
+![](/img/media/16720405454837/16720405586499.jpg)
+## Abstract
+With the release of Apache DolphinScheduler 3.1.0, many AI components have 
been added to help users to build machine learning workflows on Apache 
DolphinScheduler more efficiently.
+
+This article describes in detail how to set up DolphinScheduler with some 
Machine Learning environments. It also introduces the use of the MLflow 
component and the DVC component with experimental examples.
+
+## DolphinScheduler and Machine Learning Environment
+Test Program
+All code can be found at 
https://github.com/jieguangzhou/dolphinscheduler-ml-tutorial
+
+Get the code
+
+```git clone <https://github.com/jieguangzhou/dolphinscheduler-ml-tutorial.git>
+git checkout dev
+```
+### Installation environment
+**Conda**
+Simply install it following the official website and add the path to Conda to 
the environment variables
+
+After installation mlflow and dvc commands will be installed in conda’s bin 
directory.
+```
+pip install mlflow==1.30.0 dvc
+```
+
+**Java8 environment**
+
+```sudo apt-get update
+sudo apt-get install openjdk-8-jdk
+java -version
+```
+Configure the Java environment variable, ~/.bashrc or ~/.zshrc
+
+```# Confirm that your jdk is as below and configure the environment variables
+export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
+export PATH=$PATH:$JAVA_HOME/bin
+```
+
+**Apache DolphinScheduler 3.1.0**
+
+Download DolphinScheduler 3.1.0
+```
+# Go to the following directory (you can install in other directories, for the 
convenience of replication, in this case, the installation is performed in the 
following directory)
+cd first-example/install_dolphinscheduler
+## install DolphinScheduler
+wget 
<https://dlcdn.apache.org/dolphinscheduler/3.1.0/apache-dolphinscheduler-3.1.0-bin.tar.gz>
+tar -zxvf apache-dolphinscheduler-3.1.0-bin.tar.gz
+rm apache-dolphinscheduler-3.1.0-bin.tar.gz
+```
+
+Configuring the Conda environment and Python environment in DolphinScheduler
+```
+## Configure conda environment and default python environment
+cp common.properties apache-dolphinscheduler-3.1.0-bin/standalone-server/conf
+echo "export PATH=$(which conda)/bin:\\$PATH" >> 
apache-dolphinscheduler-3.1.0-bin/bin/env/dolphinscheduler_env.sh
+echo "export PYTHON_HOME=$(dirname $(which conda))/python" >> 
apache-dolphinscheduler-3.1.0-bin/bin/env/dolphinscheduler_env.sh
+```
+
+* dolphinscheduler-mlflow configuration
+When using the MLFLOW component, the dolphinscheduler-mlflow project on GitHub 
will be used as a reference, so if you can’t get a proper network connection, 
you can replace the repository source by following these steps
+
+Firstly execute git clone 
<https://github.com/apache/dolphinscheduler-mlflow.git>
+
+Then change the value of the ml.mlflow.preset_repository field in 
common.properties to the default path for the download
+
+Start DolphinScheduler
+```
+## start DolphinScheduler
+cd apache-dolphinscheduler-3.1.0-bin
+bash bin/dolphinscheduler-daemon.sh start standalone-server
+## You can view the log using the following command
+# tail -500f standalone-server/logs/dolphinscheduler-standalone.log
+```
+
+Once started, wait a moment for the service to boot up and you will be taken 
to the DolphinScheduler page
+
+Open http://localhost:12345/dolphinscheduler/ui and you will see the 
DolphinScheduler page
+
+Account: admin, Password: dolphinscheduler123
+![](/img/media/16720405454837/16720407528096.jpg)
+**MLflow**
+The MLflow Tracking Server is relatively simple to start up, and can simply be 
started by using the command docker run — name mlflow -p 5000:5000 -d 
jalonzjg/mlflow:latest
+
+Open http://localhost:5000, and you will be able to find the MLflow model and 
test management page
+
+![](/img/media/16720405454837/16720407653742.jpg)
+The Dockerfile for this mirror image can be found at 
first-example/docker-mlflow/Dockerfile
+
+**Components Introduction**
+There are 5 main types of components used in this article
+
+**SHELL component**
+The SHELL component is used to run shell-type tasks
+
+**PYTHON component**
+The PYTHON component is used to run python-type tasks
+
+**CONDITIONS component**
+CONDITIONS is a conditional node that determines which downstream task should 
be run based on the running status of the upstream task.
+
+**MLFLOW component**
+MLFLOW component is used to run the MLflow Project on DolphinScheduler based 
on the dolphinscheduler-mlflow library to implement pre-built algorithms and 
AutoML functionality for classification scenarios and to deploy models on the 
MLflow tracking server
+
+**DVC component**
+DVC component is used for data versioning in machine learning on 
DolphinScheduler, such as registering specific data as a specific version and 
downloading specific versions of data.
+
+Among the above five components
+
+* SHELL component and PYTHON component are the base components, which can run 
a wide range of tasks.
+* CONDITIONS are logical components that can dynamically control the logic of 
the workflow’s operation.
+* The MLFLOW component and DVC component are machine learning type components 
that can be used to facilitate the ease of use of machine learning scenario 
feature capabilities within the workflow.
+Machine learning workflow
+The workflow consists of three parts.
+
+* The first part is the preliminary preparation, such as data download, data 
versioning management repository, etc.; it is a one-time preparation.
+* The second part is the training model workflow: it includes data 
pre-processing, training model, and model evaluation
+* The third part is the deployment workflow, which includes model deployment 
and interface testing.
+
+Preliminary preparation workflow
+Create a directory to store all the process data mkdir /tmp/ds-ml-example
+
+At the beginning of the program, we need to download the test data and 
initialize the DVC repository for data versioning
+
+All the following commands are run in the 
dolphinscheduler-ml-tutorial/first-example directory
+
+Since we are submitting the workflow via pydolphinscheduler, let’s install pip 
install apache-dolphinscheduler==3.1.0
+
+Workflow(download-data): Downloading test data
+
+Command: pydolphinscheduler yaml -f pyds/download_data.yaml
+
+Execute the following two tasks in order
+
+1. Install-dependencies: install the python dependencies packages needed in 
the download script
+
+2. Download-data: download the dataset to /tmp/ds-ml-example/raw
+
+![](/img/media/16720405454837/16720408372893.jpg)
+Workflow(dvc_init_local): Initialize the dvc data versioning management 
repository
+
+Command: pydolphinscheduler yaml -f pyds/init_dvc_repo.yaml
+
+Execute the following tasks in order
+
+1. create_git_repo: Create an empty git repository in the local environment
+
+2. init_dvc: convert the repository to a dvc-type repository for data 
versioning
+
+3. condition: determine the status of the init_dvc task, if successful then 
execute report_success_message, otherwise execute report_error_message
+
+![](/img/media/16720405454837/16720408471707.jpg)
+Training model workflow
+In the training model part, the workflow includes data pre-processing, model 
training, and model evaluation.
+
+Workflow(download-data): data preprocessing
+
+Command: pydolphinscheduler yaml -f pyds/prepare_data.yaml
+
+![](/img/media/16720405454837/16720408537181.jpg)
+Perform the following tasks in order
+
+1. data_preprocessing: preprocesses the data, for demo purposes, we’ve only 
perform a simple truncation procedure here
+
+2. upload_data: uploads data to the repository and registers it as a specific 
version v1
+
+The following image shows the information in the git repository
+
+![](/img/media/16720405454837/16720408664980.jpg)
+Workflow(train_model): Training model
+
+Command: pydolphinscheduler yaml -f pyds/train_model.yaml
+
+Perform the following tasks in order
+
+1. clean_exists_data: Delete the historical data generated by potentially 
repeated tests /tmp/ds-ml-example/train_data
+
+2. pull_data: pull v1 data to /tmp/ds-ml-example/train_data
+
+3. train_automl: Uses the MLFLOW component’s AutoML function to train the 
classification model and register it with the MLflow Tracking Server, if the 
current model version F1 is the highest, then register it as the Production 
version.
+
+4. inference: import a small part of the data for batch inference using the 
mlflow CLI
+
+5. evaluate: Obtain the results of the inference and perform a simple 
evaluation of the model again, which includes the metrics of the new data, the 
projected label distribution, etc.
+![](/img/media/16720405454837/16720408742949.jpg)
+
+
+The results of the test and the model can be viewed in the MLflow Tracking 
Server ( http://localhost:5000 ) after train_automl has completed its operation.
+
+![](/img/media/16720405454837/16720408868765.jpg)
+The operation logs for the evaluation task can be viewed after it has 
completed its operation.
+
+![](/img/media/16720405454837/16720408963992.jpg)
+Deployment Process Workflow
+Workflow(deploy_model): Deployment model
+
+Run: pydolphinscheduler yaml -f pyds/deploy.yaml
+
+Run the following tasks in order.
+
+1. kill-server: Shut down the previous server
+
+2. deploy-model: Deploy the model
+
+3. test-server: Test the server
+
+![](/img/media/16720405454837/16720409057879.jpg)
+If this workflow is started manually, the interface will look as follows, just 
enter the port number and the model version number.
+
+![](/img/media/16720405454837/16720409115839.jpg)
+Integrate the workflows
+For practical use, after obtaining stable workflow iterations, the whole 
process needs to be linked together, for example after getting a new version, 
then train the model, and if it performs better, then deploy the model.
+
+For example, we switch to the production version git checkout 
first-example-production
+
+The differences between the two versions are:
+
+1. there is an additional workflow definition in train_and_deploy.yaml, which 
is used to combine the various workflows
+
+2. modify the pre-processing script to get the v2 data
+
+3. change the flag in the definition of each sub-workflow to false and let 
train_and_deploy.yaml run in unison.
+
+Run: pydolphinscheduler yaml -f pyds/train_and_deploy.yaml
+
+Each task in the diagram below is a sub-workflow task, which corresponds to 
the three workflows described above.
+
+![](/img/media/16720405454837/16720409204499.jpg)
+As below, the new version of the model, version2, is obtained after the 
operation and has been registered as the Production version
+
+![](/img/media/16720405454837/16720409274430.jpg)
+
diff --git a/blog/img/media/16720397220045/16720397367629.jpg 
b/blog/img/media/16720397220045/16720397367629.jpg
new file mode 100644
index 0000000000..38390fff76
Binary files /dev/null and b/blog/img/media/16720397220045/16720397367629.jpg 
differ
diff --git a/blog/img/media/16720400637574/16720400704016.jpg 
b/blog/img/media/16720400637574/16720400704016.jpg
new file mode 100644
index 0000000000..c900e1f6f0
Binary files /dev/null and b/blog/img/media/16720400637574/16720400704016.jpg 
differ
diff --git a/blog/img/media/16720400637574/16720400759248.jpg 
b/blog/img/media/16720400637574/16720400759248.jpg
new file mode 100644
index 0000000000..2c6dd50560
Binary files /dev/null and b/blog/img/media/16720400637574/16720400759248.jpg 
differ
diff --git a/blog/img/media/16720400637574/16720401185208.jpg 
b/blog/img/media/16720400637574/16720401185208.jpg
new file mode 100644
index 0000000000..6f65489aa6
Binary files /dev/null and b/blog/img/media/16720400637574/16720401185208.jpg 
differ
diff --git a/blog/img/media/16720400637574/16720401253440.jpg 
b/blog/img/media/16720400637574/16720401253440.jpg
new file mode 100644
index 0000000000..4d9db4352b
Binary files /dev/null and b/blog/img/media/16720400637574/16720401253440.jpg 
differ
diff --git a/blog/img/media/16720400637574/16720401472681.jpg 
b/blog/img/media/16720400637574/16720401472681.jpg
new file mode 100644
index 0000000000..ef2743f6bd
Binary files /dev/null and b/blog/img/media/16720400637574/16720401472681.jpg 
differ
diff --git a/blog/img/media/16720400637574/16720402083983.jpg 
b/blog/img/media/16720400637574/16720402083983.jpg
new file mode 100644
index 0000000000..2a6855fdb0
Binary files /dev/null and b/blog/img/media/16720400637574/16720402083983.jpg 
differ
diff --git a/blog/img/media/16720400637574/16720402291980.jpg 
b/blog/img/media/16720400637574/16720402291980.jpg
new file mode 100644
index 0000000000..e6d516381c
Binary files /dev/null and b/blog/img/media/16720400637574/16720402291980.jpg 
differ
diff --git a/blog/img/media/16720400637574/16720402508893.jpg 
b/blog/img/media/16720400637574/16720402508893.jpg
new file mode 100644
index 0000000000..66135c4c22
Binary files /dev/null and b/blog/img/media/16720400637574/16720402508893.jpg 
differ
diff --git a/blog/img/media/16720400637574/16720402711565.jpg 
b/blog/img/media/16720400637574/16720402711565.jpg
new file mode 100644
index 0000000000..75d2bcfa18
Binary files /dev/null and b/blog/img/media/16720400637574/16720402711565.jpg 
differ
diff --git a/blog/img/media/16720400637574/16720402758234.jpg 
b/blog/img/media/16720400637574/16720402758234.jpg
new file mode 100644
index 0000000000..4f2b486f97
Binary files /dev/null and b/blog/img/media/16720400637574/16720402758234.jpg 
differ
diff --git a/blog/img/media/16720400637574/16720403297820.jpg 
b/blog/img/media/16720400637574/16720403297820.jpg
new file mode 100644
index 0000000000..d2ddbb7512
Binary files /dev/null and b/blog/img/media/16720400637574/16720403297820.jpg 
differ
diff --git a/blog/img/media/16720400637574/16720403572773.jpg 
b/blog/img/media/16720400637574/16720403572773.jpg
new file mode 100644
index 0000000000..369e1c9dc2
Binary files /dev/null and b/blog/img/media/16720400637574/16720403572773.jpg 
differ
diff --git a/blog/img/media/16720400637574/16720403977529.jpg 
b/blog/img/media/16720400637574/16720403977529.jpg
new file mode 100644
index 0000000000..3c546a6219
Binary files /dev/null and b/blog/img/media/16720400637574/16720403977529.jpg 
differ
diff --git a/blog/img/media/16720400637574/16720404142720.jpg 
b/blog/img/media/16720400637574/16720404142720.jpg
new file mode 100644
index 0000000000..22673aa79a
Binary files /dev/null and b/blog/img/media/16720400637574/16720404142720.jpg 
differ
diff --git a/blog/img/media/16720400637574/16720404412259.jpg 
b/blog/img/media/16720400637574/16720404412259.jpg
new file mode 100644
index 0000000000..2c4d8d2eaa
Binary files /dev/null and b/blog/img/media/16720400637574/16720404412259.jpg 
differ
diff --git a/blog/img/media/16720405454837/16720405586499.jpg 
b/blog/img/media/16720405454837/16720405586499.jpg
new file mode 100644
index 0000000000..dff47f339b
Binary files /dev/null and b/blog/img/media/16720405454837/16720405586499.jpg 
differ
diff --git a/blog/img/media/16720405454837/16720407528096.jpg 
b/blog/img/media/16720405454837/16720407528096.jpg
new file mode 100644
index 0000000000..cf4a1c2ae5
Binary files /dev/null and b/blog/img/media/16720405454837/16720407528096.jpg 
differ
diff --git a/blog/img/media/16720405454837/16720407653742.jpg 
b/blog/img/media/16720405454837/16720407653742.jpg
new file mode 100644
index 0000000000..7483499fd4
Binary files /dev/null and b/blog/img/media/16720405454837/16720407653742.jpg 
differ
diff --git a/blog/img/media/16720405454837/16720408372893.jpg 
b/blog/img/media/16720405454837/16720408372893.jpg
new file mode 100644
index 0000000000..e4a4dc49e7
Binary files /dev/null and b/blog/img/media/16720405454837/16720408372893.jpg 
differ
diff --git a/blog/img/media/16720405454837/16720408471707.jpg 
b/blog/img/media/16720405454837/16720408471707.jpg
new file mode 100644
index 0000000000..435a69bc29
Binary files /dev/null and b/blog/img/media/16720405454837/16720408471707.jpg 
differ
diff --git a/blog/img/media/16720405454837/16720408537181.jpg 
b/blog/img/media/16720405454837/16720408537181.jpg
new file mode 100644
index 0000000000..22c055187c
Binary files /dev/null and b/blog/img/media/16720405454837/16720408537181.jpg 
differ
diff --git a/blog/img/media/16720405454837/16720408664980.jpg 
b/blog/img/media/16720405454837/16720408664980.jpg
new file mode 100644
index 0000000000..d2a3e087c4
Binary files /dev/null and b/blog/img/media/16720405454837/16720408664980.jpg 
differ
diff --git a/blog/img/media/16720405454837/16720408742949.jpg 
b/blog/img/media/16720405454837/16720408742949.jpg
new file mode 100644
index 0000000000..8c403a6582
Binary files /dev/null and b/blog/img/media/16720405454837/16720408742949.jpg 
differ
diff --git a/blog/img/media/16720405454837/16720408868765.jpg 
b/blog/img/media/16720405454837/16720408868765.jpg
new file mode 100644
index 0000000000..d38346a64f
Binary files /dev/null and b/blog/img/media/16720405454837/16720408868765.jpg 
differ
diff --git a/blog/img/media/16720405454837/16720408963992.jpg 
b/blog/img/media/16720405454837/16720408963992.jpg
new file mode 100644
index 0000000000..538f2f239e
Binary files /dev/null and b/blog/img/media/16720405454837/16720408963992.jpg 
differ
diff --git a/blog/img/media/16720405454837/16720409057879.jpg 
b/blog/img/media/16720405454837/16720409057879.jpg
new file mode 100644
index 0000000000..df8deba7a0
Binary files /dev/null and b/blog/img/media/16720405454837/16720409057879.jpg 
differ
diff --git a/blog/img/media/16720405454837/16720409115839.jpg 
b/blog/img/media/16720405454837/16720409115839.jpg
new file mode 100644
index 0000000000..b6899fc11b
Binary files /dev/null and b/blog/img/media/16720405454837/16720409115839.jpg 
differ
diff --git a/blog/img/media/16720405454837/16720409204499.jpg 
b/blog/img/media/16720405454837/16720409204499.jpg
new file mode 100644
index 0000000000..9e5e2d180a
Binary files /dev/null and b/blog/img/media/16720405454837/16720409204499.jpg 
differ
diff --git a/blog/img/media/16720405454837/16720409274430.jpg 
b/blog/img/media/16720405454837/16720409274430.jpg
new file mode 100644
index 0000000000..6cabb7a729
Binary files /dev/null and b/blog/img/media/16720405454837/16720409274430.jpg 
differ
diff --git a/config/blog/en-us/release.json b/config/blog/en-us/release.json
index cca5b78b9d..2fb3a43560 100644
--- a/config/blog/en-us/release.json
+++ b/config/blog/en-us/release.json
@@ -1,5 +1,11 @@
 
 {
+  "Apache_dolphinScheduler_3.1.2": {
+    "title": "Apache DolphinScheduler releases version 3.1.2 with Python API 
optimizations",
+    "author": "Leonard Nie",
+    "dateStr": "2022-12-24",
+    "desc": "Recently, Apache DolphinScheduler released version 3.1.2........ "
+  },
   "Apache_dolphinScheduler_3.0.3": {
     "title": "DolphinScheduler released version 3.0.3, focusing on fixing 6 
bugs",
     "author": "Leonard Nie",
diff --git a/config/blog/en-us/tech.json b/config/blog/en-us/tech.json
index 11fa91973a..45412d474e 100644
--- a/config/blog/en-us/tech.json
+++ b/config/blog/en-us/tech.json
@@ -1,4 +1,5 @@
 {
+
   "DolphinScheduler_python_api_ci_cd": {
     "title": "DolphinScheduler Python API CI/CD",
     "author": "Leonard Nie",
@@ -11,6 +12,12 @@
     "dateStr": "2022-12-10",
     "desc": "Apache DolphinScheduler has officially launched on the AWS EC2 
AMI application marketplace... "
   },
+  "Quick_Start_with_Apache_DolphinScheduler_Machine_Learning_Workflow": {
+    "title": "Quick Start with Apache DolphinScheduler Machine Learning 
Workflow",
+    "author": "Leonard Nie",
+    "dateStr": "2022-12-5",
+    "desc": "With the release of Apache DolphinScheduler 3.1.0, many AI 
components... "
+  },
   "How_can_more_people_benefit_from_big_data": {
     "title": "How can more people benefit from big data?",
     "author": "Leonard Nie",
diff --git a/config/blog/en-us/user.json b/config/blog/en-us/user.json
index ea0e7d697f..e902f0fd52 100644
--- a/config/blog/en-us/user.json
+++ b/config/blog/en-us/user.json
@@ -1,4 +1,11 @@
-{
+{ 
"Application_transformation_of_the_FinTech_data_center_based_on_DolphinScheduler":
 {
+  "title": "Application transformation of the FinTech data center based on 
DolphinScheduler",
+  "author": "Leonard Nie",
+  "dateStr": "2022-12-6",
+  "desc": "On Apache DolphinScheduler Meetup last week, ... ",
+  "img": "/img/media/16720400637574/16720400704016.jpg",
+  "logo": ""
+},
   
"How_did_Yili_explore_a_path_for_digital_transformation_based_on_DolphinScheduler":
 {
     "title": "How did Yili explore a “path” for digital transformation based 
on DolphinScheduler?",
     "author": "Debra Chen",

[dolphinscheduler-website] branch master updated: ADD Blog (#871)

Reply via email to