[GitHub] [airflow] TobKed commented on a change in pull request #13461: Add How To Guide for Dataflow

GitBox Mon, 11 Jan 2021 04:03:57 -0800


TobKed commented on a change in pull request #13461:
URL: https://github.com/apache/airflow/pull/13461#discussion_r554999776




##########
File path: docs/apache-airflow-providers-google/operators/cloud/dataflow.rst
##########
@@ -0,0 +1,274 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+Google Cloud Dataflow Operators
+===============================
+
+`Dataflow <https://cloud.google.com/dataflow/>`__ is a managed service for
+executing a wide variety of data processing patterns. These pipelines are 
created
+using the Apache Beam programming model which allows for both batch and 
streaming.
+
+.. contents::
+  :depth: 1
+  :local:
+
+Prerequisite Tasks
+^^^^^^^^^^^^^^^^^^
+
+.. include::/operators/_partials/prerequisite_tasks.rst
+
+Ways to run a data pipeline
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There are multiple options to execute a Dataflow pipeline on Airflow. If 
looking to execute the pipeline
+code from a source file (Java or Python) it would be best to use the language 
specific create operators.
+If a process exists to stage the pipeline code in an abstracted manner - a 
Templated job would be best as
+it allows development of the application without minimal intrusion to the DAG 
containing operators for it.
+It is also possible to run jobs defined in SQL language.
+
+Starting a new job
+^^^^^^^^^^^^^^^^^^
+
+To create a new pipeline using the source file (JAR in Java or Python file) use
+the create job operators. The source file can be located on GCS or on the 
local filesystem.
+:class:`~airflow.providers.google.cloud.operators.dataflow.DataflowCreateJavaJobOperator`
+or
+:class:`~airflow.providers.google.cloud.operators.dataflow.DataflowCreatePythonJobOperator`
+
+Please see the notes below on Java and Python specific SDKs as they each have 
their own set
+of execution options when running pipelines.
+
+Language specific pipelines
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Based on which language (SDK) is used for the Dataflow operators, there are 
specific options to be wary of.
+
+.. _howto/operator:DataflowCreateJavaJobOperator:
+
+Java SDK pipelines
+""""""""""""""""""
+
+The ``jar`` argument must be specified for
+:class:`~airflow.providers.google.cloud.operators.dataflow.DataflowCreateJavaJobOperator`
+as it contains the pipeline to be executed on Dataflow. The JAR can be 
available on GCS that Airflow
+has the ability to download or available on the local filesystem (provide the 
absolute path to it).
+
+Here is an example of creating and running a pipeline in Java with jar stored 
on GCS:
+
+.. exampleinclude:: 
/../../airflow/providers/google/cloud/example_dags/example_dataflow.py
+    :language: python
+    :dedent: 4
+    :start-after: [START howto_operator_start_java_job_jar_on_gcs]
+    :end-before: [END howto_operator_start_java_job_jar_on_gcs]
+
+
+Here is an example of creating and running a pipeline in Java with jar stored 
on GCS:
+
+.. exampleinclude:: 
/../../airflow/providers/google/cloud/example_dags/example_dataflow.py
+    :language: python
+    :dedent: 4
+    :start-after: [START howto_operator_start_java_job_local_jar]
+    :end-before: [END howto_operator_start_java_job_local_jar]
+
+.. _howto/operator:DataflowCreatePythonJobOperator:
+
+Python SDK pipelines
+""""""""""""""""""""
+
+The ``py_file`` argument must be specified for
+:class:`~airflow.providers.google.cloud.operators.dataflow.DataflowCreatePythonJobOperator`
+as it contains the pipeline to be executed on Dataflow. The Python file can be 
available on GCS that Airflow
+has the ability to download or available on the local filesystem (provide the 
absolute path to it).
+
+The ``py_interpreter`` argument specifies the Python version to be used when 
executing the pipeline, the default
+is ``python3`. If your Airflow instance is running on Python 2 - specify 
``python2`` and ensure your ``py_file`` is
+in Python 2. For best results, use Python 3.
+
+If ``py_requirements`` argument is specified a temporary Python virtual 
environment with specified requirements will be created
+and within it pipeline will run.
+
+The ``py_system_site_packages`` argument specifies whether or not all the 
Python packages from your Airflow instance,
+will be accessible within virtual environment (if ``py_requirements`` argument 
is specified),
+recommend avoiding unless the Dataflow job requires it.
+
+.. exampleinclude:: 
/../../airflow/providers/google/cloud/example_dags/example_dataflow.py
+    :language: python
+    :dedent: 4
+    :start-after: [START howto_operator_start_python_job]
+    :end-before: [END howto_operator_start_python_job]
+
+
+Execution options for pipelines
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Dataflow has multiple options of executing pipelines. It can be done in the 
following modes:
+batch asynchronously (fire and forget), batch blocking (wait until 
completion), or streaming (run indefinitely).

Review comment:
       It is based on the Dataflow documentation:
   
   
https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#configuring-pipelineoptions-for-execution-on-the-cloud-dataflow-service




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] TobKed commented on a change in pull request #13461: Add How To Guide for Dataflow

Reply via email to