Re: [PR] Yaml API: Day Zero tutorial notebook [beam]

via GitHub Thu, 01 Feb 2024 15:18:08 -0800


Polber commented on code in PR #27284:
URL: https://github.com/apache/beam/pull/27284#discussion_r1475289458



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,671 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "view-in-github"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "cellView": "form"
+   },
+   "outputs": [],
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "lNKIMlEDZ_Vw"
+   },
+   "source": [
+    "# Try Apache Beam - YAML\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it still has a high barrier for getting started and 
authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be a challenge.\n",
+    "\n",
+    "Here we provide a simple declarative syntax for describing pipelines that 
does not require coding experience or learning how to use an SDK&mdash;any text 
editor will do. Some installation may be required to actually *execute* a 
pipeline, but we envision various services (such as Dataflow) to accept yaml 
pipelines directly obviating the need for even that in the future. We also 
anticipate the ability to generate code directly from these higher-level yaml 
descriptions, should one want to graduate to a full Beam SDK (and possibly the 
other direction as well as far as possible).\n",
+    "\n",
+    "It should be noted that everything here is still under development, but 
any features already included are considered stable. Feedback is welcome at 
[email protected].\n",
+    "\n",
+    "In this notebook, you set up your development environment and write a 
simple pipeline using YAML. Then you run it locally, using the 
[DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can 
explore other runners with the [Beam Capability 
Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",
+    "\n",
+    "To navigate through different sections, use the table of contents. From 
**View**  drop-down list, select **Table of contents**.\n",
+    "\n",
+    "To run a code cell, click the **Run cell** button at the top left of the 
cell, or select it and press **`Shift+Enter`**. Try modifying a code cell and 
re-running it to see what happens.\n",
+    "\n",
+    "To learn more about Colab, see [Welcome to 
Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "Fz6KSQ13_3Rr"
+   },
+   "source": [
+    "# Setup\n",
+    "\n",
+    "First, you need to set up your environment. The following code installs 
`apache-beam` and creates directories for your data, pipelines and results."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/";,
+     "height": 170
+    },
+    "colab_type": "code",
+    "id": "GOOk81Jj_yUy",
+    "outputId": "d283dfb2-4f51-4fec-816b-f57b0cb9b71c"
+   },
+   "outputs": [],
+   "source": [
+    "def save_to_file(content, file_name):\n",
+    "  with open(file_name, 'w') as f:\n",
+    "    f.write(content)\n",
+    "\n",
+    "# Install apache-beam.\n",
+    "! pip install --quiet apache-beam\n",
+    "\n",
+    "# Create a directory for storing the data, pipelines and results\n",
+    "! mkdir -p data\n",
+    "! mkdir -p pipelines\n",
+    "! mkdir -p results"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We'll also create an artificial dataset that represents a simple 
database. The csv file contains information about different people. Each line 
represents a single person and their details are separated by commas. The file 
contains 5 columns: id, firstname, age, country and a profession."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "csv_data= '''\n",
+    "id,firstname,age,country,profession\n",
+    "1,Reeba,58,Belgium,unemployed\n",
+    "2,Maud,45,Spain,firefighter\n",
+    "3,Meg,11,France,unemployed\n",
+    "4,Rani,53,Spain,doctor\n",
+    "5,Natka,26,France,doctor\n",
+    "6,Aurore,32,Italy,police officer\n",
+    "7,Elvira,39,Italy,doctor\n",
+    "8,Asia,10,Belgium,doctor\n",
+    "9,Lesly,35,Spain,firefighter\n",
+    "10,Orelia,31,Germany,police officer\n",
+    "11,Theodora,16,Italy,unemployed\n",
+    "12,Viviene,44,Germany,police officer\n",
+    "13,Teriann,50,Belgium,police officer\n",
+    "14,Carol-Jean,23,Germany,unemployed\n",
+    "15,Drucie,15,Spain,police officer\n",
+    "16,Elie,10,Italy,doctor\n",
+    "17,Shaylyn,34,Belgium,worker\n",
+    "18,Fayre,33,Spain,police officer\n",
+    "19,Sabina,52,Germany,police officer\n",
+    "20,Aryn,20,Belgium,police officer\n",
+    "21,Darlleen,49,Spain,worker\n",
+    "22,Jere,18,Italy,worker\n",
+    "23,Candi,60,Germany,police officer\n",
+    "24,Sindee,40,Germany,firefighter\n",
+    "25,Selma,20,Spain,worker\n",
+    "26,Vonny,35,Germany,doctor\n",
+    "27,Kate,53,Spain,worker\n",
+    "28,Annabela,48,Belgium,worker\n",
+    "29,Jenilee,55,Germany,police officer\n",
+    "30,Gusella,44,France,police officer\n",
+    "31,Fawne,35,Spain,worker\n",
+    "32,Karolina,39,Spain,police officer\n",
+    "33,Sadie,58,Germany,firefighter\n",
+    "34,Clo,10,Italy,police officer\n",
+    "35,Beth,46,Spain,firefighter\n",
+    "36,Adore,18,Italy,firefighter\n",
+    "37,Tarra,29,Spain,doctor\n",
+    "38,Jessamyn,36,France,police officer\n",
+    "39,Deedee,24,Germany,unemployed\n",
+    "40,Patricia,45,Italy,doctor\n",
+    "41,Wileen,39,Spain,police officer\n",
+    "42,Paola,55,Italy,worker\n",
+    "43,Gwyneth,37,Italy,worker\n",
+    "44,Stacey,36,Spain,worker\n",
+    "45,Camile,60,Germany,unemployed\n",
+    "46,Sheree,10,Spain,unemployed\n",
+    "47,Albertina,53,France,police officer\n",
+    "48,Jinny,30,Spain,firefighter\n",
+    "49,Kayla,57,Italy,firefighter\n",
+    "50,Jaime,55,France,doctor\n",
+    "'''\n",
+    "\n",
+    "save_to_file(csv_data, 'data/people.csv')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's validate if the file was created correctly. You should see the 
first few lines from the generated file. Validate if the beginning of the file 
matches with the declared content above."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! head data/people.csv"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Your first YAML pipelines\n",
+    "\n",
+    "In this section we'll present you the basic structure of a YAML pipeline 
and present you some available transforms.\n",
+    "Below is a simple pipeline that reads data from the csv file we've just 
created and logs the elements for debugging purposes.\n",
+    "\n",
+    "The `LogForTesting` transform lets us log the data when developing a 
pipeline. Remember, it is not advised to use this transform in production.\n",
+    "\n",
+    "Let's define the pipeline and save it to a file:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pipeline = '''\n",
+    "pipeline:\n",
+    "  type: chain\n",
+    "  transforms:\n",
+    "    - type: ReadFromCsv\n",
+    "      config:\n",
+    "        path: data/people.csv\n",
+    "    - type: LogForTesting\n",
+    "'''\n",
+    "save_to_file(pipeline, 'pipelines/pipeline-01.yaml')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "We can verify the contents of this file by running:"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "! cat pipelines/pipeline-01.yaml"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Now, we can execute the yaml pipeline by passing this file as an argument 
to the following command:"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "! python -m apache_beam.yaml.main 
--pipeline_spec_file=pipelines/pipeline-01.yaml"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here we use Python and `apache_beam` package to execute the pipeline, but 
we envision various services (such as Dataflow) to accept yaml pipelines 
directly obviating the need for that in the future.\n",
+    "\n",
+    "If you scroll through the output logs, you'll find entries such as:\n",
+    "```\n",
+    "INFO:root:BeamSchema_edf39b51_91da_418a_b28e_af04c9bae811(id=1, 
firstname='Reeba', age=58, country='Belgium', profession='unemployed')\n",
+    "INFO:root:BeamSchema_edf39b51_91da_418a_b28e_af04c9bae811(id=2, 
firstname='Maud', age=45, country='Spain', profession='firefighter')\n",
+    "INFO:root:BeamSchema_edf39b51_91da_418a_b28e_af04c9bae811(id=3, 
firstname='Meg', age=11, country='France', profession='unemployed')\n",
+    "INFO:root:BeamSchema_edf39b51_91da_418a_b28e_af04c9bae811(id=4, 
firstname='Rani', age=53, country='Spain', profession='doctor')\n",
+    "```\n",
+    "This is a representation of records from the input dataset."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's add a transform - `Filter`. To use this transform you need to  
specify the 'keep' condition and a language your condition is written in. Below 
you'll find an example with a condition written in Python.\n",
+    "This pipeline will filter out records containing people that are younger 
than 18 years old. The only records left to the next transform will be records 
representing adults. Verify this by scrolling to the bottom of the output logs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pipeline = '''\n",
+    "pipeline:\n",
+    "  type: chain\n",
+    "  transforms:\n",
+    "    - type: ReadFromCsv\n",
+    "      config:\n",
+    "        path: data/people.csv\n",
+    "    - type: Filter\n",
+    "      config:\n",
+    "        language: python\n",
+    "        keep: \"age >= 18\"\n",
+    "    - type: LogForTesting\n",
+    "'''\n",
+    "save_to_file(pipeline, 'pipelines/pipeline-filter-01.yaml')\n",
+    "! python -m apache_beam.yaml.main 
--pipeline_spec_file=pipelines/pipeline-filter-01.yaml"

Review Comment:
   In which case I would move the run command to its own cell.



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,671 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "view-in-github"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "cellView": "form"
+   },
+   "outputs": [],
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "lNKIMlEDZ_Vw"
+   },
+   "source": [
+    "# Try Apache Beam - YAML\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it still has a high barrier for getting started and 
authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be a challenge.\n",
+    "\n",
+    "Here we provide a simple declarative syntax for describing pipelines that 
does not require coding experience or learning how to use an SDK&mdash;any text 
editor will do. Some installation may be required to actually *execute* a 
pipeline, but we envision various services (such as Dataflow) to accept yaml 
pipelines directly obviating the need for even that in the future. We also 
anticipate the ability to generate code directly from these higher-level yaml 
descriptions, should one want to graduate to a full Beam SDK (and possibly the 
other direction as well as far as possible).\n",
+    "\n",
+    "It should be noted that everything here is still under development, but 
any features already included are considered stable. Feedback is welcome at 
[email protected].\n",
+    "\n",
+    "In this notebook, you set up your development environment and write a 
simple pipeline using YAML. Then you run it locally, using the 
[DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can 
explore other runners with the [Beam Capability 
Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",
+    "\n",
+    "To navigate through different sections, use the table of contents. From 
**View**  drop-down list, select **Table of contents**.\n",
+    "\n",
+    "To run a code cell, click the **Run cell** button at the top left of the 
cell, or select it and press **`Shift+Enter`**. Try modifying a code cell and 
re-running it to see what happens.\n",
+    "\n",
+    "To learn more about Colab, see [Welcome to 
Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "Fz6KSQ13_3Rr"
+   },
+   "source": [
+    "# Setup\n",
+    "\n",
+    "First, you need to set up your environment. The following code installs 
`apache-beam` and creates directories for your data, pipelines and results."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/";,
+     "height": 170
+    },
+    "colab_type": "code",
+    "id": "GOOk81Jj_yUy",
+    "outputId": "d283dfb2-4f51-4fec-816b-f57b0cb9b71c"
+   },
+   "outputs": [],
+   "source": [
+    "def save_to_file(content, file_name):\n",
+    "  with open(file_name, 'w') as f:\n",
+    "    f.write(content)\n",
+    "\n",
+    "# Install apache-beam.\n",
+    "! pip install --quiet apache-beam\n",
+    "\n",
+    "# Create a directory for storing the data, pipelines and results\n",
+    "! mkdir -p data\n",
+    "! mkdir -p pipelines\n",
+    "! mkdir -p results"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We'll also create an artificial dataset that represents a simple 
database. The csv file contains information about different people. Each line 
represents a single person and their details are separated by commas. The file 
contains 5 columns: id, firstname, age, country and a profession."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "csv_data= '''\n",
+    "id,firstname,age,country,profession\n",
+    "1,Reeba,58,Belgium,unemployed\n",
+    "2,Maud,45,Spain,firefighter\n",
+    "3,Meg,11,France,unemployed\n",
+    "4,Rani,53,Spain,doctor\n",
+    "5,Natka,26,France,doctor\n",
+    "6,Aurore,32,Italy,police officer\n",
+    "7,Elvira,39,Italy,doctor\n",
+    "8,Asia,10,Belgium,doctor\n",
+    "9,Lesly,35,Spain,firefighter\n",
+    "10,Orelia,31,Germany,police officer\n",
+    "11,Theodora,16,Italy,unemployed\n",
+    "12,Viviene,44,Germany,police officer\n",
+    "13,Teriann,50,Belgium,police officer\n",
+    "14,Carol-Jean,23,Germany,unemployed\n",
+    "15,Drucie,15,Spain,police officer\n",
+    "16,Elie,10,Italy,doctor\n",
+    "17,Shaylyn,34,Belgium,worker\n",
+    "18,Fayre,33,Spain,police officer\n",
+    "19,Sabina,52,Germany,police officer\n",
+    "20,Aryn,20,Belgium,police officer\n",
+    "21,Darlleen,49,Spain,worker\n",
+    "22,Jere,18,Italy,worker\n",
+    "23,Candi,60,Germany,police officer\n",
+    "24,Sindee,40,Germany,firefighter\n",
+    "25,Selma,20,Spain,worker\n",
+    "26,Vonny,35,Germany,doctor\n",
+    "27,Kate,53,Spain,worker\n",
+    "28,Annabela,48,Belgium,worker\n",
+    "29,Jenilee,55,Germany,police officer\n",
+    "30,Gusella,44,France,police officer\n",
+    "31,Fawne,35,Spain,worker\n",
+    "32,Karolina,39,Spain,police officer\n",
+    "33,Sadie,58,Germany,firefighter\n",
+    "34,Clo,10,Italy,police officer\n",
+    "35,Beth,46,Spain,firefighter\n",
+    "36,Adore,18,Italy,firefighter\n",
+    "37,Tarra,29,Spain,doctor\n",
+    "38,Jessamyn,36,France,police officer\n",
+    "39,Deedee,24,Germany,unemployed\n",
+    "40,Patricia,45,Italy,doctor\n",
+    "41,Wileen,39,Spain,police officer\n",
+    "42,Paola,55,Italy,worker\n",
+    "43,Gwyneth,37,Italy,worker\n",
+    "44,Stacey,36,Spain,worker\n",
+    "45,Camile,60,Germany,unemployed\n",
+    "46,Sheree,10,Spain,unemployed\n",
+    "47,Albertina,53,France,police officer\n",
+    "48,Jinny,30,Spain,firefighter\n",
+    "49,Kayla,57,Italy,firefighter\n",
+    "50,Jaime,55,France,doctor\n",
+    "'''\n",
+    "\n",
+    "save_to_file(csv_data, 'data/people.csv')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's validate if the file was created correctly. You should see the 
first few lines from the generated file. Validate if the beginning of the file 
matches with the declared content above."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! head data/people.csv"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Your first YAML pipelines\n",
+    "\n",
+    "In this section we'll present you the basic structure of a YAML pipeline 
and present you some available transforms.\n",
+    "Below is a simple pipeline that reads data from the csv file we've just 
created and logs the elements for debugging purposes.\n",
+    "\n",
+    "The `LogForTesting` transform lets us log the data when developing a 
pipeline. Remember, it is not advised to use this transform in production.\n",
+    "\n",
+    "Let's define the pipeline and save it to a file:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pipeline = '''\n",
+    "pipeline:\n",
+    "  type: chain\n",
+    "  transforms:\n",
+    "    - type: ReadFromCsv\n",
+    "      config:\n",
+    "        path: data/people.csv\n",
+    "    - type: LogForTesting\n",
+    "'''\n",
+    "save_to_file(pipeline, 'pipelines/pipeline-01.yaml')"

Review Comment:
   Colab has a built-in writefile method that I think is less messy and allows 
makes specifying string literals in YAML easier
   ```suggestion
       "%%writefile pipelines/pipeline-01.yaml\n",
       "\n",
       "pipeline:\n",
       "  type: chain\n",
       "  transforms:\n",
       "    - type: ReadFromCsv\n",
       "      config:\n",
       "        path: data/people.csv\n",
       "    - type: LogForTesting\n"
   ```



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,671 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "view-in-github"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "cellView": "form"
+   },
+   "outputs": [],
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "lNKIMlEDZ_Vw"
+   },
+   "source": [
+    "# Try Apache Beam - YAML\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it still has a high barrier for getting started and 
authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be a challenge.\n",
+    "\n",
+    "Here we provide a simple declarative syntax for describing pipelines that 
does not require coding experience or learning how to use an SDK&mdash;any text 
editor will do. Some installation may be required to actually *execute* a 
pipeline, but we envision various services (such as Dataflow) to accept yaml 
pipelines directly obviating the need for even that in the future. We also 
anticipate the ability to generate code directly from these higher-level yaml 
descriptions, should one want to graduate to a full Beam SDK (and possibly the 
other direction as well as far as possible).\n",
+    "\n",
+    "It should be noted that everything here is still under development, but 
any features already included are considered stable. Feedback is welcome at 
[email protected].\n",
+    "\n",
+    "In this notebook, you set up your development environment and write a 
simple pipeline using YAML. Then you run it locally, using the 
[DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can 
explore other runners with the [Beam Capability 
Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",
+    "\n",
+    "To navigate through different sections, use the table of contents. From 
**View**  drop-down list, select **Table of contents**.\n",
+    "\n",
+    "To run a code cell, click the **Run cell** button at the top left of the 
cell, or select it and press **`Shift+Enter`**. Try modifying a code cell and 
re-running it to see what happens.\n",
+    "\n",
+    "To learn more about Colab, see [Welcome to 
Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "Fz6KSQ13_3Rr"
+   },
+   "source": [
+    "# Setup\n",
+    "\n",
+    "First, you need to set up your environment. The following code installs 
`apache-beam` and creates directories for your data, pipelines and results."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/";,
+     "height": 170
+    },
+    "colab_type": "code",
+    "id": "GOOk81Jj_yUy",
+    "outputId": "d283dfb2-4f51-4fec-816b-f57b0cb9b71c"
+   },
+   "outputs": [],
+   "source": [
+    "def save_to_file(content, file_name):\n",
+    "  with open(file_name, 'w') as f:\n",
+    "    f.write(content)\n",
+    "\n",

Review Comment:
   I believe you can get rid of this with my other comments



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,671 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "view-in-github"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "cellView": "form"
+   },
+   "outputs": [],
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "lNKIMlEDZ_Vw"
+   },
+   "source": [
+    "# Try Apache Beam - YAML\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it still has a high barrier for getting started and 
authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be a challenge.\n",
+    "\n",
+    "Here we provide a simple declarative syntax for describing pipelines that 
does not require coding experience or learning how to use an SDK&mdash;any text 
editor will do. Some installation may be required to actually *execute* a 
pipeline, but we envision various services (such as Dataflow) to accept yaml 
pipelines directly obviating the need for even that in the future. We also 
anticipate the ability to generate code directly from these higher-level yaml 
descriptions, should one want to graduate to a full Beam SDK (and possibly the 
other direction as well as far as possible).\n",
+    "\n",
+    "It should be noted that everything here is still under development, but 
any features already included are considered stable. Feedback is welcome at 
[email protected].\n",
+    "\n",
+    "In this notebook, you set up your development environment and write a 
simple pipeline using YAML. Then you run it locally, using the 
[DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can 
explore other runners with the [Beam Capability 
Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",
+    "\n",
+    "To navigate through different sections, use the table of contents. From 
**View**  drop-down list, select **Table of contents**.\n",
+    "\n",
+    "To run a code cell, click the **Run cell** button at the top left of the 
cell, or select it and press **`Shift+Enter`**. Try modifying a code cell and 
re-running it to see what happens.\n",
+    "\n",
+    "To learn more about Colab, see [Welcome to 
Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "Fz6KSQ13_3Rr"
+   },
+   "source": [
+    "# Setup\n",
+    "\n",
+    "First, you need to set up your environment. The following code installs 
`apache-beam` and creates directories for your data, pipelines and results."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/";,
+     "height": 170
+    },
+    "colab_type": "code",
+    "id": "GOOk81Jj_yUy",
+    "outputId": "d283dfb2-4f51-4fec-816b-f57b0cb9b71c"
+   },
+   "outputs": [],
+   "source": [
+    "def save_to_file(content, file_name):\n",
+    "  with open(file_name, 'w') as f:\n",
+    "    f.write(content)\n",
+    "\n",
+    "# Install apache-beam.\n",
+    "! pip install --quiet apache-beam\n",
+    "\n",
+    "# Create a directory for storing the data, pipelines and results\n",
+    "! mkdir -p data\n",
+    "! mkdir -p pipelines\n",
+    "! mkdir -p results"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We'll also create an artificial dataset that represents a simple 
database. The csv file contains information about different people. Each line 
represents a single person and their details are separated by commas. The file 
contains 5 columns: id, firstname, age, country and a profession."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "csv_data= '''\n",
+    "id,firstname,age,country,profession\n",
+    "1,Reeba,58,Belgium,unemployed\n",
+    "2,Maud,45,Spain,firefighter\n",
+    "3,Meg,11,France,unemployed\n",
+    "4,Rani,53,Spain,doctor\n",
+    "5,Natka,26,France,doctor\n",
+    "6,Aurore,32,Italy,police officer\n",
+    "7,Elvira,39,Italy,doctor\n",
+    "8,Asia,10,Belgium,doctor\n",
+    "9,Lesly,35,Spain,firefighter\n",
+    "10,Orelia,31,Germany,police officer\n",
+    "11,Theodora,16,Italy,unemployed\n",
+    "12,Viviene,44,Germany,police officer\n",
+    "13,Teriann,50,Belgium,police officer\n",
+    "14,Carol-Jean,23,Germany,unemployed\n",
+    "15,Drucie,15,Spain,police officer\n",
+    "16,Elie,10,Italy,doctor\n",
+    "17,Shaylyn,34,Belgium,worker\n",
+    "18,Fayre,33,Spain,police officer\n",
+    "19,Sabina,52,Germany,police officer\n",
+    "20,Aryn,20,Belgium,police officer\n",
+    "21,Darlleen,49,Spain,worker\n",
+    "22,Jere,18,Italy,worker\n",
+    "23,Candi,60,Germany,police officer\n",
+    "24,Sindee,40,Germany,firefighter\n",
+    "25,Selma,20,Spain,worker\n",
+    "26,Vonny,35,Germany,doctor\n",
+    "27,Kate,53,Spain,worker\n",
+    "28,Annabela,48,Belgium,worker\n",
+    "29,Jenilee,55,Germany,police officer\n",
+    "30,Gusella,44,France,police officer\n",
+    "31,Fawne,35,Spain,worker\n",
+    "32,Karolina,39,Spain,police officer\n",
+    "33,Sadie,58,Germany,firefighter\n",
+    "34,Clo,10,Italy,police officer\n",
+    "35,Beth,46,Spain,firefighter\n",
+    "36,Adore,18,Italy,firefighter\n",
+    "37,Tarra,29,Spain,doctor\n",
+    "38,Jessamyn,36,France,police officer\n",
+    "39,Deedee,24,Germany,unemployed\n",
+    "40,Patricia,45,Italy,doctor\n",
+    "41,Wileen,39,Spain,police officer\n",
+    "42,Paola,55,Italy,worker\n",
+    "43,Gwyneth,37,Italy,worker\n",
+    "44,Stacey,36,Spain,worker\n",
+    "45,Camile,60,Germany,unemployed\n",
+    "46,Sheree,10,Spain,unemployed\n",
+    "47,Albertina,53,France,police officer\n",
+    "48,Jinny,30,Spain,firefighter\n",
+    "49,Kayla,57,Italy,firefighter\n",
+    "50,Jaime,55,France,doctor\n",
+    "'''\n",
+    "\n",
+    "save_to_file(csv_data, 'data/people.csv')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's validate if the file was created correctly. You should see the 
first few lines from the generated file. Validate if the beginning of the file 
matches with the declared content above."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! head data/people.csv"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Your first YAML pipelines\n",
+    "\n",
+    "In this section we'll present you the basic structure of a YAML pipeline 
and present you some available transforms.\n",
+    "Below is a simple pipeline that reads data from the csv file we've just 
created and logs the elements for debugging purposes.\n",
+    "\n",
+    "The `LogForTesting` transform lets us log the data when developing a 
pipeline. Remember, it is not advised to use this transform in production.\n",
+    "\n",
+    "Let's define the pipeline and save it to a file:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pipeline = '''\n",
+    "pipeline:\n",
+    "  type: chain\n",
+    "  transforms:\n",
+    "    - type: ReadFromCsv\n",
+    "      config:\n",
+    "        path: data/people.csv\n",
+    "    - type: LogForTesting\n",
+    "'''\n",
+    "save_to_file(pipeline, 'pipelines/pipeline-01.yaml')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "We can verify the contents of this file by running:"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "! cat pipelines/pipeline-01.yaml"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Now, we can execute the yaml pipeline by passing this file as an argument 
to the following command:"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "! python -m apache_beam.yaml.main 
--pipeline_spec_file=pipelines/pipeline-01.yaml"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here we use Python and `apache_beam` package to execute the pipeline, but 
we envision various services (such as Dataflow) to accept yaml pipelines 
directly obviating the need for that in the future.\n",
+    "\n",
+    "If you scroll through the output logs, you'll find entries such as:\n",
+    "```\n",
+    "INFO:root:BeamSchema_edf39b51_91da_418a_b28e_af04c9bae811(id=1, 
firstname='Reeba', age=58, country='Belgium', profession='unemployed')\n",
+    "INFO:root:BeamSchema_edf39b51_91da_418a_b28e_af04c9bae811(id=2, 
firstname='Maud', age=45, country='Spain', profession='firefighter')\n",
+    "INFO:root:BeamSchema_edf39b51_91da_418a_b28e_af04c9bae811(id=3, 
firstname='Meg', age=11, country='France', profession='unemployed')\n",
+    "INFO:root:BeamSchema_edf39b51_91da_418a_b28e_af04c9bae811(id=4, 
firstname='Rani', age=53, country='Spain', profession='doctor')\n",
+    "```\n",
+    "This is a representation of records from the input dataset."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's add a transform - `Filter`. To use this transform you need to  
specify the 'keep' condition and a language your condition is written in. Below 
you'll find an example with a condition written in Python.\n",
+    "This pipeline will filter out records containing people that are younger 
than 18 years old. The only records left to the next transform will be records 
representing adults. Verify this by scrolling to the bottom of the output logs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pipeline = '''\n",
+    "pipeline:\n",
+    "  type: chain\n",
+    "  transforms:\n",
+    "    - type: ReadFromCsv\n",
+    "      config:\n",
+    "        path: data/people.csv\n",
+    "    - type: Filter\n",
+    "      config:\n",
+    "        language: python\n",
+    "        keep: \"age >= 18\"\n",
+    "    - type: LogForTesting\n",
+    "'''\n",
+    "save_to_file(pipeline, 'pipelines/pipeline-filter-01.yaml')\n",
+    "! python -m apache_beam.yaml.main 
--pipeline_spec_file=pipelines/pipeline-filter-01.yaml"

Review Comment:
   Same here.
   ```suggestion
       "%%writefile pipelines/pipeline-filter-01.yaml\n",
       "pipeline:\n",
       "  type: chain\n",
       "  transforms:\n",
       "    - type: ReadFromCsv\n",
       "      config:\n",
       "        path: data/people.csv\n",
       "    - type: Filter\n",
       "      config:\n",
       "        language: python\n",
       "        keep: \"age >= 18\"\n",
       "    - type: LogForTesting\n"
   ```



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,671 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "view-in-github"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "cellView": "form"
+   },
+   "outputs": [],
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "lNKIMlEDZ_Vw"
+   },
+   "source": [
+    "# Try Apache Beam - YAML\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it still has a high barrier for getting started and 
authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be a challenge.\n",
+    "\n",
+    "Here we provide a simple declarative syntax for describing pipelines that 
does not require coding experience or learning how to use an SDK&mdash;any text 
editor will do. Some installation may be required to actually *execute* a 
pipeline, but we envision various services (such as Dataflow) to accept yaml 
pipelines directly obviating the need for even that in the future. We also 
anticipate the ability to generate code directly from these higher-level yaml 
descriptions, should one want to graduate to a full Beam SDK (and possibly the 
other direction as well as far as possible).\n",
+    "\n",
+    "It should be noted that everything here is still under development, but 
any features already included are considered stable. Feedback is welcome at 
[email protected].\n",
+    "\n",
+    "In this notebook, you set up your development environment and write a 
simple pipeline using YAML. Then you run it locally, using the 
[DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can 
explore other runners with the [Beam Capability 
Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",
+    "\n",
+    "To navigate through different sections, use the table of contents. From 
**View**  drop-down list, select **Table of contents**.\n",
+    "\n",
+    "To run a code cell, click the **Run cell** button at the top left of the 
cell, or select it and press **`Shift+Enter`**. Try modifying a code cell and 
re-running it to see what happens.\n",
+    "\n",
+    "To learn more about Colab, see [Welcome to 
Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "Fz6KSQ13_3Rr"
+   },
+   "source": [
+    "# Setup\n",
+    "\n",
+    "First, you need to set up your environment. The following code installs 
`apache-beam` and creates directories for your data, pipelines and results."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/";,
+     "height": 170
+    },
+    "colab_type": "code",
+    "id": "GOOk81Jj_yUy",
+    "outputId": "d283dfb2-4f51-4fec-816b-f57b0cb9b71c"
+   },
+   "outputs": [],
+   "source": [
+    "def save_to_file(content, file_name):\n",
+    "  with open(file_name, 'w') as f:\n",
+    "    f.write(content)\n",
+    "\n",
+    "# Install apache-beam.\n",
+    "! pip install --quiet apache-beam\n",
+    "\n",
+    "# Create a directory for storing the data, pipelines and results\n",
+    "! mkdir -p data\n",
+    "! mkdir -p pipelines\n",
+    "! mkdir -p results"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We'll also create an artificial dataset that represents a simple 
database. The csv file contains information about different people. Each line 
represents a single person and their details are separated by commas. The file 
contains 5 columns: id, firstname, age, country and a profession."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "csv_data= '''\n",
+    "id,firstname,age,country,profession\n",
+    "1,Reeba,58,Belgium,unemployed\n",
+    "2,Maud,45,Spain,firefighter\n",
+    "3,Meg,11,France,unemployed\n",
+    "4,Rani,53,Spain,doctor\n",
+    "5,Natka,26,France,doctor\n",
+    "6,Aurore,32,Italy,police officer\n",
+    "7,Elvira,39,Italy,doctor\n",
+    "8,Asia,10,Belgium,doctor\n",
+    "9,Lesly,35,Spain,firefighter\n",
+    "10,Orelia,31,Germany,police officer\n",
+    "11,Theodora,16,Italy,unemployed\n",
+    "12,Viviene,44,Germany,police officer\n",
+    "13,Teriann,50,Belgium,police officer\n",
+    "14,Carol-Jean,23,Germany,unemployed\n",
+    "15,Drucie,15,Spain,police officer\n",
+    "16,Elie,10,Italy,doctor\n",
+    "17,Shaylyn,34,Belgium,worker\n",
+    "18,Fayre,33,Spain,police officer\n",
+    "19,Sabina,52,Germany,police officer\n",
+    "20,Aryn,20,Belgium,police officer\n",
+    "21,Darlleen,49,Spain,worker\n",
+    "22,Jere,18,Italy,worker\n",
+    "23,Candi,60,Germany,police officer\n",
+    "24,Sindee,40,Germany,firefighter\n",
+    "25,Selma,20,Spain,worker\n",
+    "26,Vonny,35,Germany,doctor\n",
+    "27,Kate,53,Spain,worker\n",
+    "28,Annabela,48,Belgium,worker\n",
+    "29,Jenilee,55,Germany,police officer\n",
+    "30,Gusella,44,France,police officer\n",
+    "31,Fawne,35,Spain,worker\n",
+    "32,Karolina,39,Spain,police officer\n",
+    "33,Sadie,58,Germany,firefighter\n",
+    "34,Clo,10,Italy,police officer\n",
+    "35,Beth,46,Spain,firefighter\n",
+    "36,Adore,18,Italy,firefighter\n",
+    "37,Tarra,29,Spain,doctor\n",
+    "38,Jessamyn,36,France,police officer\n",
+    "39,Deedee,24,Germany,unemployed\n",
+    "40,Patricia,45,Italy,doctor\n",
+    "41,Wileen,39,Spain,police officer\n",
+    "42,Paola,55,Italy,worker\n",
+    "43,Gwyneth,37,Italy,worker\n",
+    "44,Stacey,36,Spain,worker\n",
+    "45,Camile,60,Germany,unemployed\n",
+    "46,Sheree,10,Spain,unemployed\n",
+    "47,Albertina,53,France,police officer\n",
+    "48,Jinny,30,Spain,firefighter\n",
+    "49,Kayla,57,Italy,firefighter\n",
+    "50,Jaime,55,France,doctor\n",
+    "'''\n",
+    "\n",
+    "save_to_file(csv_data, 'data/people.csv')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's validate if the file was created correctly. You should see the 
first few lines from the generated file. Validate if the beginning of the file 
matches with the declared content above."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! head data/people.csv"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Your first YAML pipelines\n",
+    "\n",
+    "In this section we'll present you the basic structure of a YAML pipeline 
and present you some available transforms.\n",
+    "Below is a simple pipeline that reads data from the csv file we've just 
created and logs the elements for debugging purposes.\n",
+    "\n",
+    "The `LogForTesting` transform lets us log the data when developing a 
pipeline. Remember, it is not advised to use this transform in production.\n",
+    "\n",
+    "Let's define the pipeline and save it to a file:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pipeline = '''\n",
+    "pipeline:\n",
+    "  type: chain\n",
+    "  transforms:\n",
+    "    - type: ReadFromCsv\n",
+    "      config:\n",
+    "        path: data/people.csv\n",
+    "    - type: LogForTesting\n",
+    "'''\n",
+    "save_to_file(pipeline, 'pipelines/pipeline-01.yaml')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "We can verify the contents of this file by running:"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "! cat pipelines/pipeline-01.yaml"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Now, we can execute the yaml pipeline by passing this file as an argument 
to the following command:"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "! python -m apache_beam.yaml.main 
--pipeline_spec_file=pipelines/pipeline-01.yaml"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here we use Python and `apache_beam` package to execute the pipeline, but 
we envision various services (such as Dataflow) to accept yaml pipelines 
directly obviating the need for that in the future.\n",
+    "\n",
+    "If you scroll through the output logs, you'll find entries such as:\n",
+    "```\n",
+    "INFO:root:BeamSchema_edf39b51_91da_418a_b28e_af04c9bae811(id=1, 
firstname='Reeba', age=58, country='Belgium', profession='unemployed')\n",
+    "INFO:root:BeamSchema_edf39b51_91da_418a_b28e_af04c9bae811(id=2, 
firstname='Maud', age=45, country='Spain', profession='firefighter')\n",
+    "INFO:root:BeamSchema_edf39b51_91da_418a_b28e_af04c9bae811(id=3, 
firstname='Meg', age=11, country='France', profession='unemployed')\n",
+    "INFO:root:BeamSchema_edf39b51_91da_418a_b28e_af04c9bae811(id=4, 
firstname='Rani', age=53, country='Spain', profession='doctor')\n",
+    "```\n",
+    "This is a representation of records from the input dataset."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's add a transform - `Filter`. To use this transform you need to  
specify the 'keep' condition and a language your condition is written in. Below 
you'll find an example with a condition written in Python.\n",
+    "This pipeline will filter out records containing people that are younger 
than 18 years old. The only records left to the next transform will be records 
representing adults. Verify this by scrolling to the bottom of the output logs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pipeline = '''\n",
+    "pipeline:\n",
+    "  type: chain\n",
+    "  transforms:\n",
+    "    - type: ReadFromCsv\n",
+    "      config:\n",
+    "        path: data/people.csv\n",
+    "    - type: Filter\n",
+    "      config:\n",
+    "        language: python\n",
+    "        keep: \"age >= 18\"\n",
+    "    - type: LogForTesting\n",
+    "'''\n",
+    "save_to_file(pipeline, 'pipelines/pipeline-filter-01.yaml')\n",
+    "! python -m apache_beam.yaml.main 
--pipeline_spec_file=pipelines/pipeline-filter-01.yaml"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Similarly, we can create a condition in other languages, for example SQL. 
In this example we filter out people that are younger than 18 and have a 
profession other than being 'unemployed'."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pipeline = '''\n",
+    "pipeline:\n",
+    "  type: chain\n",
+    "  transforms:\n",
+    "    - type: ReadFromCsv\n",
+    "      config:\n",
+    "        path: data/people.csv\n",
+    "    - type: Filter\n",
+    "      config:\n",
+    "        language: sql\n",
+    "        keep: \"age >= 18 or (age < 18 and profession = 
'unemployed')\"\n",
+    "    - type: LogForTesting\n",
+    "'''\n",
+    "save_to_file(pipeline, 'pipelines/pipeline-filter-02.yaml')\n",
+    "! python -m apache_beam.yaml.main 
--pipeline_spec_file=pipelines/pipeline-filter-02.yaml"

Review Comment:
   Same here (and subsequent Yaml pipeline blocks)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Yaml API: Day Zero tutorial notebook [beam]

Reply via email to