[GitHub] [beam] agvdndor commented on a diff in pull request #23094: Concept guide on orchestrating Beam preprocessing

GitBox Wed, 02 Nov 2022 03:52:50 -0700


agvdndor commented on code in PR #23094:
URL: https://github.com/apache/beam/pull/23094#discussion_r1011557727



##########
website/www/site/content/en/documentation/ml/orchestration.md:
##########
@@ -0,0 +1,227 @@
+---
+title: "Orchestration"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Workflow orchestration
+
+## Understanding the Beam DAG
+
+
+Apache Beam is an open source, unified model for defining both batch and 
streaming data-parallel processing pipelines. One of the central concepts to 
the Beam programming model is the DAG (= Directed Acyclic Graph). Each Beam 
pipeline is a DAG that can be constructed through the Beam SDK in your 
programming language of choice (from the set of supported beam SDKs). Each node 
of this DAG represents a processing step (PTransform) that accepts a collection 
of data as input (PCollection) and outputs a transformed collection of data 
(PCollection). The edges define how data flows through the pipeline from one 
processing step to another. The image below shows an example of such a 
pipeline.  
+
+![A standalone beam pipeline](/images/standalone-beam-pipeline.svg)
+
+Note that simply defining a pipeline and the corresponding DAG does not mean 
that data will start flowing through the pipeline. To actually execute the 
pipeline, it has to be deployed to one of the [supported Beam 
runners](https://beam.apache.org/documentation/runners/capability-matrix/). 
These distributed processing back-ends include Apache Flink, Apache Spark and 
Google Cloud Dataflow. A [Direct 
Runner](https://beam.apache.org/documentation/runners/direct/) is also provided 
to execute the pipeline locally on your machine for development and debugging 
purposes. Make sure to check out the [runner capability 
matrix](https://beam.apache.org/documentation/runners/capability-matrix/) to 
guarantee that the chosen runner supports the data processing steps defined in 
your pipeline, especially when using the Direct Runner.  
+
+## Orchestrating frameworks
+
+Successfully delivering machine learning projects is about a lot more than 
training a model and calling it a day. In addition, a full ML workflow will 
often contain a range of other steps including data ingestion, data validation, 
data preprocessing, model evaluation, model deployment, data drift detection… 
On top of that, it’s essential to keep track of metadata and artifacts from 
your experiments to answer important questions like: What data was this model 
trained on and with which training parameters? When was this model deployed and 
which accuracy did it get on a test dataset? Without this knowledge at your 
disposal, it will become increasingly difficult to troubleshoot, monitor and 
improve your ML solutions as they grow in size.  
+
+The solution: MLOps. MLOps is an umbrella term used to describe best practices 
and guiding principles that aim to make the development and maintenance of 
machine learning systems seamless and efficient. Simply put, MLOps is most 
often about automating machine learning workflows throughout the model and data 
lifecycle. Popular frameworks to create these workflow DAGs are [Kubeflow 
Pipelines](https://www.kubeflow.org/docs/components/pipelines/introduction/), 
[Apache 
Airflow](https://airflow.apache.org/docs/apache-airflow/stable/index.html) and 
[TFX](https://www.tensorflow.org/tfx/guide).  
+
+So what does all of this have to do with Beam? Well, since we established that 
Beam is a great tool for a range of ML tasks, a beam pipeline can either be 
used as a standalone data processing job or can be part of a larger sequence of 
steps in such a workflow. In the latter case, the beam DAG is just one node in 
the overarching DAG composed by the workflow orchestrator. This results in a 
DAG in a DAG, as illustrated by the example below.  
+
+![An beam pipeline as part of a larger orchestrated 
workflow](/images/orchestrated-beam-pipeline.svg)
+
+It is important to understand the key difference between the Beam DAG and the 
orchestrating DAG. The Beam DAG processes data and passes that data between the 
nodes of its DAG. The focus of Beam is on parallelization and enabling both 
batch and streaming jobs. In contrast, the orchestration DAG schedules and 
monitors steps in the workflow and passed between the nodes of the DAG are 
execution parameters, metadata and artifacts. An example of such an artifact 
could be a trained model or a dataset. Such artifacts are often passed by a 
reference URI and not by value.  
+
+Note: TFX creates a workflow DAG, which needs an orchestrator of its own to be 
executed. [Natively supported orchestrators for 
TFX](https://www.tensorflow.org/tfx/guide/custom_orchestrator) are Airflow, 
Kubeflow Pipelines and, here’s the kicker, Beam itself! As mentioned by the 
[TFX docs](https://www.tensorflow.org/tfx/guide/beam_orchestrator):  
+> "Several TFX components rely on Beam for distributed data processing. In 
addition, TFX can use Apache Beam to orchestrate and execute the pipeline DAG. 
Beam orchestrator uses a different BeamRunner than the one which is used for 
component data processing."  
+
+Caveat: The Beam orchestrator is not meant to be a TFX orchestrator to be used 
in production environments. It simply enables to debug the TFX pipeline locally 
on Beam’s DirectRunner without the need for the extra setup that is needed for 
Airflow or Kubeflow.
+
+## Preprocessing example
+
+Let’s get practical and take a look at two such orchestrated ML workflows, one 
with Kubeflow Pipelines (KFP) and one with Tensorflow Extended (TFX). These two 
frameworks achieve the same goal of creating workflows, but have their own 
distinct advantages and disadvantages: KFP requires you to create your workflow 
components from scratch and requires a user to explicitly indicate which 
artifacts should be passed between components and in what way. In contrast, TFX 
offers a number of prebuilt components and takes care of the artifact passing 
more implicitly. Clearly, there is a trade-off to be considered between 
flexibility and programming overhead when choosing between the two frameworks. 
We will start by looking at an example with KFP and then transition to TFX to 
show TFX takes care of a lot of functionality that we had to define by hand in 
the KFP example.  
+
+To not overcomplicate things, the workflows are limited to three components: 
data ingestion, data preprocessing and model training. Depending on the 
scenario, a range of extra components could be added such as model evaluation, 
model deployment… We will focus our attention on the preprocessing component, 
since it showcases how to use  Apache beam in an ML workflow for efficient and 
parallel processing of your ML data.  
+
+The dataset we will use consists image-caption pairs, i.e. images paired with 
a textual caption describing the content of the image. These pairs are taken 
from captions subset of the [MSCOCO 2014 
dataset](https://cocodataset.org/#home). This multi-modal data (image + text) 
gives us the opportunity to experiment with preprocessing operations for both 
modalities.
+
+### Kubeflow pipelines (KFP)
+
+In order to execute our ML workflow with KFP we must perform three steps:
+
+1. Create the KFP components by specifying the interface to the components and 
by writing and containerizing the implementation of the component logic
+2. Create the KFP pipeline by connecting the created components and specifying 
how inputs and outputs should be passed from between components and compiling 
the pipeline definition to a full pipeline definition.
+3. Execute the KFP pipeline by submitting it to a KFP client endpoint.
+
+The full example code can be found 
[here](sdks/python/apache_beam/examples/ml-orchestration/kfp/)

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] agvdndor commented on a diff in pull request #23094: Concept guide on orchestrating Beam preprocessing

Reply via email to