[GitHub] [beam] amotley commented on a diff in pull request #27284: Yaml API: Day Zero tutorial notebook

via GitHub Thu, 29 Jun 2023 13:23:40 -0700


amotley commented on code in PR #27284:
URL: https://github.com/apache/beam/pull/27284#discussion_r1247127090



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,646 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+  "colab": {
+   "name": "Try Apache Beam - Python",
+   "version": "0.3.2",
+   "provenance": [],
+   "collapsed_sections": [],
+   "toc_visible": true,
+   "include_colab_link": true
+  },
+  "kernelspec": {
+   "name": "python2",
+   "display_name": "Python 2"
+  }
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "view-in-github",
+    "colab_type": "text"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ],
+   "outputs": [],
+   "metadata": {
+    "cellView": "form"
+   }
+  },
+  {
+   "metadata": {
+    "id": "lNKIMlEDZ_Vw",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Try Apache Beam - YAML\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it still has a high barrier for getting started and 
authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be an overwhelming amount of 
boilerplate.\n",
+    "\n",
+    "Here we provide a simple YAML syntax for describing pipelines that does 
not require coding experience or learning how to use an SDK&mdash;any text 
editor will do.\n",
+    "\n",
+    "Please note: YAML API is still EXPERIMENTAL and subject to change.\n",
+    "\n",
+    "In this notebook, we set up your development environment and write a 
simple pipeline using YAML. We'll run it locally, using the 
[DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can 
explore other runners with the [Beam Capatibility 
Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",
+    "\n",
+    "To navigate through different sections, use the table of contents. From 
**View**  drop-down list, select **Table of contents**.\n",
+    "\n",
+    "To run a code cell, you can click the **Run cell** button at the top left 
of the cell, or by select it and press **`Shift+Enter`**. Try modifying a code 
cell and re-running it to see what happens.\n",
+    "\n",
+    "To learn more about Colab, see [Welcome to 
Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb)."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "Fz6KSQ13_3Rr",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Setup\n",
+    "\n",
+    "First, you need to set up your environment, which includes installing 
`apache-beam` and downloading a text file from Cloud Storage to your local file 
system. We are using this file to test your pipeline."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "GOOk81Jj_yUy",
+    "colab_type": "code",
+    "outputId": "d283dfb2-4f51-4fec-816b-f57b0cb9b71c",
+    "colab": {
+     "base_uri": "https://localhost:8080/";,
+     "height": 170
+    }
+   },
+   "cell_type": "code",
+   "source": [
+    "# Run and print a shell command.\n",
+    "def run(cmd):\n",
+    "  print('>> {}'.format(cmd))\n",
+    "  !{cmd}\n",
+    "  print('')\n",
+    "\n",
+    "def save_to_file(content, file_name):\n",
+    "  with open(file_name, 'w') as f:\n",
+    "    f.write(content)\n",
+    "\n",
+    "# Install apache-beam.\n",
+    "run('pip install --quiet apache-beam')\n",
+    "\n",
+    "# Copy the input files into the local file system.\n",
+    "run('mkdir -p data')\n",
+    "run('gsutil cp gs://dataflow-samples/shakespeare/kinglear.txt 
data/kinglear.txt')\n",
+    "run('gsutil cp 
gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection 
data/SMSSpamCollection.csv')"
+   ],
+   "execution_count": 134,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      ">> pip install --quiet apache-beam\n",
+      "\n",
+      ">> mkdir -p data\n",
+      "\n",
+      ">> gsutil cp gs://dataflow-samples/shakespeare/kinglear.txt 
data/kinglear.txt\n",
+      "Copying gs://dataflow-samples/shakespeare/kinglear.txt...\r\n",
+      "/ [1 files][153.6 KiB/153.6 KiB]                                        
        \r\n",
+      "Operation completed over 1 objects/153.6 KiB.                           
         \r\n",
+      "\n",
+      ">> gsutil cp 
gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection 
data/SMSSpamCollection.csv\n",
+      "Copying 
gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection...\r\n",
+      "/ [1 files][466.7 KiB/466.7 KiB]                                        
        \r\n",
+      "Operation completed over 1 objects/466.7 KiB.                           
         \r\n",
+      "\n"
+     ]
+    }
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Inspect the data\n",
+    "We'll be working with 2 datasets. We'll use `kinglear.txt` for the first 
example - word count, and `SMSSpamCollection.csv` for the second and third.\n",
+    "Let's first take a loot at the `kinglear.txt` dataset."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 135,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      ">> head data/kinglear.txt\n",
+      "\tKING LEAR\r\n",
+      "\r\n",
+      "\r\n",
+      "\tDRAMATIS PERSONAE\r\n",
+      "\r\n",
+      "\r\n",
+      "LEAR\tking of Britain  (KING LEAR:)\r\n",
+      "\r\n",
+      "KING OF FRANCE:\r\n",
+      "\r\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "run('head data/kinglear.txt')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "This is just a `txt` file - it contains lines of text.\n",
+    "Let's take a look at the other dataset."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 136,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      ">> head data/SMSSpamCollection.csv\n",
+      "ham\tGo until jurong point, crazy.. Available only in bugis n great 
world la e buffet... Cine there got amore wat...\r\n",
+      "ham\tOk lar... Joking wif u oni...\r\n",
+      "spam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 
2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 
08452810075over18's\r\n",
+      "ham\tU dun say so early hor... U c already then say...\r\n",
+      "ham\tNah I don't think he goes to usf, he lives around here though\r\n",
+      "spam\tFreeMsg Hey there darling it's been 3 week's now and no word 
back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 
to rcv\r\n",
+      "ham\tEven my brother is not like to speak with me. They treat me like 
aids patent.\r\n",
+      "ham\tAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu 
Vettam)' has been set as your callertune for all Callers. Press *9 to copy your 
friends Callertune\r\n",
+      "spam\tWINNER!! As a valued network customer you have been selected to 
receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 
12 hours only.\r\n",
+      "spam\tHad your mobile 11 months or more? U R entitled to Update to the 
latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 
08002986030\r\n",
+      "\n",
+      ">> wc -l data/SMSSpamCollection.csv\n",
+      "5574 data/SMSSpamCollection.csv\r\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "run('head data/SMSSpamCollection.csv')\n",
+    "run('wc -l data/SMSSpamCollection.csv')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "This dataset is a `csv` file with 5,574 rows and 2 columns recording the 
following attributes separated by a tab sign:\n",
+    "1. `Column 1`: The label (either `ham` or `spam`)\n",
+    "2. `Column 2`: The SMS as raw text (type `string`)"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Example 1: word count\n",
+    "In this popular introductory exercise, we will build a pipeline that 
reads lines of text from the input dataset `kinglear.txt` and counts the number 
of times each word appears in the text.\n",
+    "To start, we'll create a `yaml` file specifying our pipeline."

Review Comment:
   ```suggestion
       "To start, we'll create a `YAML` file specifying our pipeline."
   ```



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,646 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+  "colab": {
+   "name": "Try Apache Beam - Python",
+   "version": "0.3.2",
+   "provenance": [],
+   "collapsed_sections": [],
+   "toc_visible": true,
+   "include_colab_link": true
+  },
+  "kernelspec": {
+   "name": "python2",
+   "display_name": "Python 2"
+  }
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "view-in-github",
+    "colab_type": "text"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ],
+   "outputs": [],
+   "metadata": {
+    "cellView": "form"
+   }
+  },
+  {
+   "metadata": {
+    "id": "lNKIMlEDZ_Vw",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Try Apache Beam - YAML\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it still has a high barrier for getting started and 
authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be an overwhelming amount of 
boilerplate.\n",
+    "\n",
+    "Here we provide a simple YAML syntax for describing pipelines that does 
not require coding experience or learning how to use an SDK&mdash;any text 
editor will do.\n",
+    "\n",
+    "Please note: YAML API is still EXPERIMENTAL and subject to change.\n",
+    "\n",
+    "In this notebook, we set up your development environment and write a 
simple pipeline using YAML. We'll run it locally, using the 
[DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can 
explore other runners with the [Beam Capatibility 
Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",
+    "\n",
+    "To navigate through different sections, use the table of contents. From 
**View**  drop-down list, select **Table of contents**.\n",
+    "\n",
+    "To run a code cell, you can click the **Run cell** button at the top left 
of the cell, or by select it and press **`Shift+Enter`**. Try modifying a code 
cell and re-running it to see what happens.\n",
+    "\n",
+    "To learn more about Colab, see [Welcome to 
Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb)."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "Fz6KSQ13_3Rr",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Setup\n",
+    "\n",
+    "First, you need to set up your environment, which includes installing 
`apache-beam` and downloading a text file from Cloud Storage to your local file 
system. We are using this file to test your pipeline."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "GOOk81Jj_yUy",
+    "colab_type": "code",
+    "outputId": "d283dfb2-4f51-4fec-816b-f57b0cb9b71c",
+    "colab": {
+     "base_uri": "https://localhost:8080/";,
+     "height": 170
+    }
+   },
+   "cell_type": "code",
+   "source": [
+    "# Run and print a shell command.\n",
+    "def run(cmd):\n",
+    "  print('>> {}'.format(cmd))\n",
+    "  !{cmd}\n",
+    "  print('')\n",
+    "\n",
+    "def save_to_file(content, file_name):\n",
+    "  with open(file_name, 'w') as f:\n",
+    "    f.write(content)\n",
+    "\n",
+    "# Install apache-beam.\n",
+    "run('pip install --quiet apache-beam')\n",
+    "\n",
+    "# Copy the input files into the local file system.\n",
+    "run('mkdir -p data')\n",
+    "run('gsutil cp gs://dataflow-samples/shakespeare/kinglear.txt 
data/kinglear.txt')\n",
+    "run('gsutil cp 
gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection 
data/SMSSpamCollection.csv')"
+   ],
+   "execution_count": 134,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      ">> pip install --quiet apache-beam\n",
+      "\n",
+      ">> mkdir -p data\n",
+      "\n",
+      ">> gsutil cp gs://dataflow-samples/shakespeare/kinglear.txt 
data/kinglear.txt\n",
+      "Copying gs://dataflow-samples/shakespeare/kinglear.txt...\r\n",
+      "/ [1 files][153.6 KiB/153.6 KiB]                                        
        \r\n",
+      "Operation completed over 1 objects/153.6 KiB.                           
         \r\n",
+      "\n",
+      ">> gsutil cp 
gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection 
data/SMSSpamCollection.csv\n",
+      "Copying 
gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection...\r\n",
+      "/ [1 files][466.7 KiB/466.7 KiB]                                        
        \r\n",
+      "Operation completed over 1 objects/466.7 KiB.                           
         \r\n",
+      "\n"
+     ]
+    }
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Inspect the data\n",
+    "We'll be working with 2 datasets. We'll use `kinglear.txt` for the first 
example - word count, and `SMSSpamCollection.csv` for the second and third.\n",
+    "Let's first take a loot at the `kinglear.txt` dataset."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 135,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      ">> head data/kinglear.txt\n",
+      "\tKING LEAR\r\n",
+      "\r\n",
+      "\r\n",
+      "\tDRAMATIS PERSONAE\r\n",
+      "\r\n",
+      "\r\n",
+      "LEAR\tking of Britain  (KING LEAR:)\r\n",
+      "\r\n",
+      "KING OF FRANCE:\r\n",
+      "\r\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "run('head data/kinglear.txt')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "This is just a `txt` file - it contains lines of text.\n",
+    "Let's take a look at the other dataset."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 136,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      ">> head data/SMSSpamCollection.csv\n",
+      "ham\tGo until jurong point, crazy.. Available only in bugis n great 
world la e buffet... Cine there got amore wat...\r\n",
+      "ham\tOk lar... Joking wif u oni...\r\n",
+      "spam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 
2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 
08452810075over18's\r\n",
+      "ham\tU dun say so early hor... U c already then say...\r\n",
+      "ham\tNah I don't think he goes to usf, he lives around here though\r\n",
+      "spam\tFreeMsg Hey there darling it's been 3 week's now and no word 
back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 
to rcv\r\n",
+      "ham\tEven my brother is not like to speak with me. They treat me like 
aids patent.\r\n",
+      "ham\tAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu 
Vettam)' has been set as your callertune for all Callers. Press *9 to copy your 
friends Callertune\r\n",
+      "spam\tWINNER!! As a valued network customer you have been selected to 
receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 
12 hours only.\r\n",
+      "spam\tHad your mobile 11 months or more? U R entitled to Update to the 
latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 
08002986030\r\n",
+      "\n",
+      ">> wc -l data/SMSSpamCollection.csv\n",
+      "5574 data/SMSSpamCollection.csv\r\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "run('head data/SMSSpamCollection.csv')\n",
+    "run('wc -l data/SMSSpamCollection.csv')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "This dataset is a `csv` file with 5,574 rows and 2 columns recording the 
following attributes separated by a tab sign:\n",
+    "1. `Column 1`: The label (either `ham` or `spam`)\n",
+    "2. `Column 2`: The SMS as raw text (type `string`)"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Example 1: word count\n",
+    "In this popular introductory exercise, we will build a pipeline that 
reads lines of text from the input dataset `kinglear.txt` and counts the number 
of times each word appears in the text.\n",
+    "To start, we'll create a `yaml` file specifying our pipeline."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 153,
+   "outputs": [],
+   "source": [
+    "pipeline = '''\n",
+    "pipeline:\n",
+    "  type: chain\n",
+    "  transforms:\n",
+    "\n",
+    "    # Read input data. Each line from the csv file is a String.\n",
+    "    - type: ReadFromText\n",
+    "      name: InputText\n",
+    "      file_pattern: data/kinglear.txt\n",
+    "\n",
+    "    # Using a regex, we'll split the content of the message (one long 
string) into words (list of strings).\n",
+    "    - type: PyFlatMap\n",
+    "      name: FindWords\n",
+    "      fn: |\n",
+    "        import re\n",
+    "        lambda line: re.findall(r\"[a-zA-Z]+\", line)\n",
+    "\n",
+    "    # Transforming each word to lower case and combining it with a '1'. 
Result of this step are pairs (word: 1).\n",
+    "    - type: PyMap\n",
+    "      name: PairWordsWith1\n",
+    "      fn: 'lambda word: (word, 1)'\n",
+    "\n",
+    "    # Using SumPerKey transform, we'll calculate the occurrence of each 
word.\n",
+    "    - type: SumPerKey\n",
+    "      name: GroupAndSum\n",
+    "\n",
+    "    # Format results - each record should be represented as 'word: 
count'.\n",
+    "    # The 'fn' parameter accepts functions written in Python\n",
+    "    - type: PyMap\n",
+    "      name: FormatResults\n",
+    "      fn: \"lambda word_count_tuple: f'{word_count_tuple[0]}: 
{word_count_tuple[1]}'\"\n",
+    "\n",
+    "    # Save results to a text file.\n",
+    "    - type: WriteToText\n",
+    "      name: SaveToText\n",
+    "      file_path_prefix: \"data/result-pipeline-01\"\n",
+    "      file_name_suffix: \".txt\"\n",
+    "'''\n",
+    "save_to_file(pipeline, 'pipeline-01.yaml')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Let's tun the pipeline executing the Python script with the pipeline file 
as an argument:"

Review Comment:
   ```suggestion
       "Let's run the pipeline executing the Python script with the pipeline 
file as an argument:"
   ```



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,646 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+  "colab": {
+   "name": "Try Apache Beam - Python",
+   "version": "0.3.2",
+   "provenance": [],
+   "collapsed_sections": [],
+   "toc_visible": true,
+   "include_colab_link": true
+  },
+  "kernelspec": {
+   "name": "python2",
+   "display_name": "Python 2"
+  }
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "view-in-github",
+    "colab_type": "text"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ],
+   "outputs": [],
+   "metadata": {
+    "cellView": "form"
+   }
+  },
+  {
+   "metadata": {
+    "id": "lNKIMlEDZ_Vw",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Try Apache Beam - YAML\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it still has a high barrier for getting started and 
authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be an overwhelming amount of 
boilerplate.\n",
+    "\n",
+    "Here we provide a simple YAML syntax for describing pipelines that does 
not require coding experience or learning how to use an SDK&mdash;any text 
editor will do.\n",
+    "\n",
+    "Please note: YAML API is still EXPERIMENTAL and subject to change.\n",
+    "\n",
+    "In this notebook, we set up your development environment and write a 
simple pipeline using YAML. We'll run it locally, using the 
[DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can 
explore other runners with the [Beam Capatibility 
Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",
+    "\n",
+    "To navigate through different sections, use the table of contents. From 
**View**  drop-down list, select **Table of contents**.\n",
+    "\n",
+    "To run a code cell, you can click the **Run cell** button at the top left 
of the cell, or by select it and press **`Shift+Enter`**. Try modifying a code 
cell and re-running it to see what happens.\n",
+    "\n",
+    "To learn more about Colab, see [Welcome to 
Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb)."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "Fz6KSQ13_3Rr",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Setup\n",
+    "\n",
+    "First, you need to set up your environment, which includes installing 
`apache-beam` and downloading a text file from Cloud Storage to your local file 
system. We are using this file to test your pipeline."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "GOOk81Jj_yUy",
+    "colab_type": "code",
+    "outputId": "d283dfb2-4f51-4fec-816b-f57b0cb9b71c",
+    "colab": {
+     "base_uri": "https://localhost:8080/";,
+     "height": 170
+    }
+   },
+   "cell_type": "code",
+   "source": [
+    "# Run and print a shell command.\n",
+    "def run(cmd):\n",
+    "  print('>> {}'.format(cmd))\n",
+    "  !{cmd}\n",
+    "  print('')\n",
+    "\n",
+    "def save_to_file(content, file_name):\n",
+    "  with open(file_name, 'w') as f:\n",
+    "    f.write(content)\n",
+    "\n",
+    "# Install apache-beam.\n",
+    "run('pip install --quiet apache-beam')\n",
+    "\n",
+    "# Copy the input files into the local file system.\n",
+    "run('mkdir -p data')\n",
+    "run('gsutil cp gs://dataflow-samples/shakespeare/kinglear.txt 
data/kinglear.txt')\n",
+    "run('gsutil cp 
gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection 
data/SMSSpamCollection.csv')"
+   ],
+   "execution_count": 134,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      ">> pip install --quiet apache-beam\n",
+      "\n",
+      ">> mkdir -p data\n",
+      "\n",
+      ">> gsutil cp gs://dataflow-samples/shakespeare/kinglear.txt 
data/kinglear.txt\n",
+      "Copying gs://dataflow-samples/shakespeare/kinglear.txt...\r\n",
+      "/ [1 files][153.6 KiB/153.6 KiB]                                        
        \r\n",
+      "Operation completed over 1 objects/153.6 KiB.                           
         \r\n",
+      "\n",
+      ">> gsutil cp 
gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection 
data/SMSSpamCollection.csv\n",
+      "Copying 
gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection...\r\n",
+      "/ [1 files][466.7 KiB/466.7 KiB]                                        
        \r\n",
+      "Operation completed over 1 objects/466.7 KiB.                           
         \r\n",
+      "\n"
+     ]
+    }
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Inspect the data\n",
+    "We'll be working with 2 datasets. We'll use `kinglear.txt` for the first 
example - word count, and `SMSSpamCollection.csv` for the second and third.\n",
+    "Let's first take a loot at the `kinglear.txt` dataset."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 135,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      ">> head data/kinglear.txt\n",
+      "\tKING LEAR\r\n",
+      "\r\n",
+      "\r\n",
+      "\tDRAMATIS PERSONAE\r\n",
+      "\r\n",
+      "\r\n",
+      "LEAR\tking of Britain  (KING LEAR:)\r\n",
+      "\r\n",
+      "KING OF FRANCE:\r\n",
+      "\r\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "run('head data/kinglear.txt')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "This is just a `txt` file - it contains lines of text.\n",
+    "Let's take a look at the other dataset."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 136,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      ">> head data/SMSSpamCollection.csv\n",
+      "ham\tGo until jurong point, crazy.. Available only in bugis n great 
world la e buffet... Cine there got amore wat...\r\n",
+      "ham\tOk lar... Joking wif u oni...\r\n",
+      "spam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 
2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 
08452810075over18's\r\n",
+      "ham\tU dun say so early hor... U c already then say...\r\n",
+      "ham\tNah I don't think he goes to usf, he lives around here though\r\n",
+      "spam\tFreeMsg Hey there darling it's been 3 week's now and no word 
back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 
to rcv\r\n",
+      "ham\tEven my brother is not like to speak with me. They treat me like 
aids patent.\r\n",
+      "ham\tAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu 
Vettam)' has been set as your callertune for all Callers. Press *9 to copy your 
friends Callertune\r\n",
+      "spam\tWINNER!! As a valued network customer you have been selected to 
receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 
12 hours only.\r\n",
+      "spam\tHad your mobile 11 months or more? U R entitled to Update to the 
latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 
08002986030\r\n",
+      "\n",
+      ">> wc -l data/SMSSpamCollection.csv\n",
+      "5574 data/SMSSpamCollection.csv\r\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "run('head data/SMSSpamCollection.csv')\n",
+    "run('wc -l data/SMSSpamCollection.csv')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "This dataset is a `csv` file with 5,574 rows and 2 columns recording the 
following attributes separated by a tab sign:\n",
+    "1. `Column 1`: The label (either `ham` or `spam`)\n",
+    "2. `Column 2`: The SMS as raw text (type `string`)"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Example 1: word count\n",
+    "In this popular introductory exercise, we will build a pipeline that 
reads lines of text from the input dataset `kinglear.txt` and counts the number 
of times each word appears in the text.\n",
+    "To start, we'll create a `yaml` file specifying our pipeline."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 153,
+   "outputs": [],
+   "source": [
+    "pipeline = '''\n",
+    "pipeline:\n",
+    "  type: chain\n",
+    "  transforms:\n",
+    "\n",
+    "    # Read input data. Each line from the csv file is a String.\n",
+    "    - type: ReadFromText\n",
+    "      name: InputText\n",
+    "      file_pattern: data/kinglear.txt\n",
+    "\n",
+    "    # Using a regex, we'll split the content of the message (one long 
string) into words (list of strings).\n",
+    "    - type: PyFlatMap\n",
+    "      name: FindWords\n",
+    "      fn: |\n",
+    "        import re\n",
+    "        lambda line: re.findall(r\"[a-zA-Z]+\", line)\n",
+    "\n",
+    "    # Transforming each word to lower case and combining it with a '1'. 
Result of this step are pairs (word: 1).\n",
+    "    - type: PyMap\n",
+    "      name: PairWordsWith1\n",
+    "      fn: 'lambda word: (word, 1)'\n",
+    "\n",
+    "    # Using SumPerKey transform, we'll calculate the occurrence of each 
word.\n",
+    "    - type: SumPerKey\n",
+    "      name: GroupAndSum\n",
+    "\n",
+    "    # Format results - each record should be represented as 'word: 
count'.\n",
+    "    # The 'fn' parameter accepts functions written in Python\n",
+    "    - type: PyMap\n",
+    "      name: FormatResults\n",
+    "      fn: \"lambda word_count_tuple: f'{word_count_tuple[0]}: 
{word_count_tuple[1]}'\"\n",
+    "\n",
+    "    # Save results to a text file.\n",
+    "    - type: WriteToText\n",
+    "      name: SaveToText\n",
+    "      file_path_prefix: \"data/result-pipeline-01\"\n",
+    "      file_name_suffix: \".txt\"\n",
+    "'''\n",
+    "save_to_file(pipeline, 'pipeline-01.yaml')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Let's tun the pipeline executing the Python script with the pipeline file 
as an argument:"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 154,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      ">> python -m apache_beam.yaml.main 
--pipeline_spec_file=pipeline-01.yaml\n",
+      "Building pipeline...\r\n",
+      "INFO:apache_beam.yaml.yaml_transform:Expanding \"InputText\" at line 7 
\r\n",
+      "INFO:apache_beam.yaml.yaml_transform:Expanding \"FindWords\" at line 12 
\r\n",
+      "INFO:apache_beam.yaml.yaml_transform:Expanding \"PairWordsWith1\" at 
line 19 \r\n",
+      "INFO:apache_beam.yaml.yaml_transform:Expanding \"GroupAndSum\" at line 
24 \r\n",
+      "INFO:apache_beam.yaml.yaml_transform:Expanding \"FormatResults\" at 
line 28 \r\n",
+      "INFO:apache_beam.yaml.yaml_transform:Expanding \"SaveToText\" at line 
33 \r\n",
+      "Running pipeline...\r\n",
+      "WARNING:apache_beam.io.filebasedsink:Deleting 1 existing files in 
target path matching: -*-of-%(num_shards)05d\r\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "run('python -m apache_beam.yaml.main 
--pipeline_spec_file=pipeline-01.yaml')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Let's inspect the data. Each line contains a word and an associated 
count."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 155,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      ">> head data/result-pipeline-01-00000-of-00001.txt\n",
+      "KING: 243\r\n",
+      "LEAR: 236\r\n",
+      "DRAMATIS: 1\r\n",
+      "PERSONAE: 1\r\n",
+      "king: 65\r\n",
+      "of: 447\r\n",
+      "Britain: 2\r\n",
+      "OF: 15\r\n",
+      "FRANCE: 10\r\n",
+      "DUKE: 3\r\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "run('head data/result-pipeline-01-00000-of-00001.txt')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Example 2: load data, filter unwanted lines and save results to a text 
file.\n",
+    "In this example, we'll create a pipeline which loads the data, filters 
out valid messages leaving spam, and saves only valid lines to a file.\n",
+    "Please note that this time we didn't specify the top-level type as 
`chain`. This allows for more flexibility when creating a pipeline, but 
requires us to specify the `input` explicitly for each transform.\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 138,
+   "outputs": [],
+   "source": [
+    "pipeline = '''\n",
+    "pipeline:\n",
+    "\n",
+    "  - type: ReadFromText\n",
+    "    name: SmsData\n",
+    "    file_pattern: data/SMSSpamCollection.csv\n",
+    "\n",
+    "  - type: PyMap\n",
+    "    name: SplitLine\n",
+    "    input: SmsData\n",
+    "    fn: 'lambda line: line.split(\"\\\\t\")'\n",
+    "\n",
+    "  - type: PyFilter\n",
+    "    name: KeepSpam\n",
+    "    input: SplitLine\n",
+    "    keep: 'lambda row: row[0] == \"spam\"'\n",
+    "\n",
+    "  - type: WriteToText\n",
+    "    name: SaveToText\n",
+    "    input: KeepSpam\n",
+    "    file_path_prefix: \"data/result-pipeline-01\"\n",
+    "    file_name_suffix: \".txt\"\n",
+    "'''\n",
+    "save_to_file(pipeline, 'pipeline-02.yaml')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Since this pipeline is linear, we can rewrite it to a `chain` and drop 
the `input`s in transforms."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "pipeline = '''\n",
+    "pipeline:\n",
+    "  type: chain\n",
+    "  transforms:\n",
+    "    - type: ReadFromText\n",
+    "      name: SmsData\n",
+    "      file_pattern: data/SMSSpamCollection.csv\n",
+    "\n",
+    "    - type: PyMap\n",
+    "      name: SplitLine\n",
+    "      fn: 'lambda line: line.split(\"\\\\t\")'\n",
+    "\n",
+    "    - type: PyFilter\n",
+    "      name: KeepSpam\n",
+    "      keep: 'lambda row: row[0] == \"spam\"'\n",
+    "\n",
+    "    - type: WriteToText\n",
+    "      name: SaveToText\n",
+    "      file_path_prefix: \"data/result-pipeline-02\"\n",
+    "      file_name_suffix: \".txt\"\n",
+    "'''\n",
+    "save_to_file(pipeline, 'pipeline-02-chain.yaml')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "To run the pipeline locally, using a DirectRunner, you need to run the 
yaml's main python script, passing the `pipeline-02-chain.yaml` (or 
`pipeline-02.yaml`) file as an input:"

Review Comment:
   ```suggestion
       "To run the pipeline locally, using a DirectRunner, you need to run the 
YAML's main python script, passing the `pipeline-02-chain.yaml` (or 
`pipeline-02.yaml`) file as an input:"
   ```



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,424 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+  "colab": {
+   "name": "Try Apache Beam - Python",
+   "version": "0.3.2",
+   "provenance": [],
+   "collapsed_sections": [],
+   "toc_visible": true,
+   "include_colab_link": true
+  },
+  "kernelspec": {
+   "name": "python2",
+   "display_name": "Python 2"
+  }
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "view-in-github",
+    "colab_type": "text"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ],
+   "outputs": [],
+   "metadata": {
+    "cellView": "form"
+   }
+  },
+  {
+   "metadata": {
+    "id": "lNKIMlEDZ_Vw",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Try Apache Beam - Yaml\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it often still has too high a barrier for getting started 
and authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be an overwhelming amount of 
boilerplate for some (though 
https://beam.apache.org/blog/beam-starter-projects/ has gone a long way in 
making this easier).\n",
+    "\n",
+    "Here we provide a simple declarative syntax for describing pipelines that 
does not require coding experience or learning how to use an SDK&mdash;any text 
editor will do. Some installation may be required to actually *execute* a 
pipeline, but we envision various services (such as Dataflow) to accept yaml 
pipelines directly obviating the need for even that in the future. We also 
anticipate the ability to generate code directly from these higher-level yaml 
descriptions, should one want to graduate to a full Beam SDK (and possibly the 
other direction as well as far as possible).\n",
+    "\n",
+    "In this notebook, we set up your development environment and write a 
simple pipeline using Yaml API. We'll run it locally, using the 
[DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can 
explore other runners with the [Beam Capatibility 
Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",
+    "\n",
+    "To navigate through different sections, use the table of contents. From 
**View**  drop-down list, select **Table of contents**.\n",
+    "\n",
+    "To run a code cell, you can click the **Run cell** button at the top left 
of the cell, or by select it and press **`Shift+Enter`**. Try modifying a code 
cell and re-running it to see what happens.\n",
+    "\n",
+    "To learn more about Colab, see [Welcome to 
Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb)."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "Fz6KSQ13_3Rr",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Setup\n",
+    "\n",
+    "First, you need to set up your environment, which includes installing 
`apache-beam` and downloading a text file from Cloud Storage to your local file 
system. We are using this file to test your pipeline."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "GOOk81Jj_yUy",
+    "colab_type": "code",
+    "outputId": "d283dfb2-4f51-4fec-816b-f57b0cb9b71c",
+    "colab": {
+     "base_uri": "https://localhost:8080/";,
+     "height": 170
+    }
+   },
+   "cell_type": "code",
+   "source": [
+    "# Run and print a shell command.\n",
+    "def run(cmd):\n",
+    "  print('>> {}'.format(cmd))\n",
+    "  !{cmd}\n",
+    "  print('')\n",
+    "\n",
+    "def save_to_file(content, file_name):\n",
+    "  with open(file_name, 'w') as f:\n",
+    "    f.write(content)\n",
+    "\n",
+    "# Install apache-beam.\n",
+    "run('pip install --quiet apache-beam')\n",
+    "\n",
+    "# Copy the input file into the local file system.\n",
+    "run('mkdir -p data')\n",
+    "run('gsutil cp 
gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection 
data/SMSSpamCollection.csv')"
+   ],
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Inspect the data\n",
+    "Let’s see how our data looks like."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "run('head data/SMSSpamCollection.csv')\n",
+    "run('wc -l data/SMSSpamCollection.csv')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "This dataset is a `csv` file with 5,574 rows and 2 columns recording the 
following attributes separated by a tab sign:\n",
+    "1. `Column 1`: The label (either `ham` or `spam`)\n",
+    "2. `Column 2`: The SMS as raw text (type `string`)"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## First pipeline\n",
+    "We’ll start with creating a pipeline which loads the data, filters out 
valid messages leaving spam, and saves only valid lines to a file.\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "pipeline = '''\n",
+    "pipeline:\n",
+    "  - type: ReadFromText\n",
+    "    name: SmsData\n",
+    "    file_pattern: data/SMSSpamCollection.csv\n",
+    "\n",
+    "  - type: PyMap\n",
+    "    name: SplitLine\n",
+    "    input: SmsData\n",
+    "    fn: 'lambda line: line.split(\"\\\\t\")'\n",
+    "\n",
+    "  - type: PyFilter\n",
+    "    name: KeepSpam\n",
+    "    input: SplitLine\n",
+    "    keep: 'lambda row: row[0] == \"spam\"'\n",
+    "\n",
+    "  - type: WriteToText\n",
+    "    name: SaveToText\n",
+    "    input: KeepSpam\n",
+    "    file_path_prefix: \"data/result-pipeline-01\"\n",
+    "    file_name_suffix: \".txt\"\n",
+    "'''\n",
+    "save_to_file(pipeline, 'pipeline-01.yaml')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "In this example, each transformation contains the 'input' key, but if the 
pipeline is linear, such as ours, we can let the inputs be implicit by 
designating the pipeline as a `chain` type.\n"

Review Comment:
   I would rather keep the inputs and remove "chain" entirely since it's more 
advanced syntax.



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,646 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+  "colab": {
+   "name": "Try Apache Beam - Python",
+   "version": "0.3.2",
+   "provenance": [],
+   "collapsed_sections": [],
+   "toc_visible": true,
+   "include_colab_link": true
+  },
+  "kernelspec": {
+   "name": "python2",
+   "display_name": "Python 2"
+  }
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "view-in-github",
+    "colab_type": "text"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ],
+   "outputs": [],
+   "metadata": {
+    "cellView": "form"
+   }
+  },
+  {
+   "metadata": {
+    "id": "lNKIMlEDZ_Vw",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Try Apache Beam - YAML\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it still has a high barrier for getting started and 
authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be an overwhelming amount of 
boilerplate.\n",
+    "\n",
+    "Here we provide a simple YAML syntax for describing pipelines that does 
not require coding experience or learning how to use an SDK&mdash;any text 
editor will do.\n",
+    "\n",
+    "Please note: YAML API is still EXPERIMENTAL and subject to change.\n",
+    "\n",
+    "In this notebook, we set up your development environment and write a 
simple pipeline using YAML. We'll run it locally, using the 
[DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can 
explore other runners with the [Beam Capatibility 
Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",
+    "\n",
+    "To navigate through different sections, use the table of contents. From 
**View**  drop-down list, select **Table of contents**.\n",
+    "\n",
+    "To run a code cell, you can click the **Run cell** button at the top left 
of the cell, or by select it and press **`Shift+Enter`**. Try modifying a code 
cell and re-running it to see what happens.\n",
+    "\n",
+    "To learn more about Colab, see [Welcome to 
Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb)."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "Fz6KSQ13_3Rr",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Setup\n",
+    "\n",
+    "First, you need to set up your environment, which includes installing 
`apache-beam` and downloading a text file from Cloud Storage to your local file 
system. We are using this file to test your pipeline."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "GOOk81Jj_yUy",
+    "colab_type": "code",
+    "outputId": "d283dfb2-4f51-4fec-816b-f57b0cb9b71c",
+    "colab": {
+     "base_uri": "https://localhost:8080/";,
+     "height": 170
+    }
+   },
+   "cell_type": "code",
+   "source": [
+    "# Run and print a shell command.\n",
+    "def run(cmd):\n",
+    "  print('>> {}'.format(cmd))\n",
+    "  !{cmd}\n",
+    "  print('')\n",
+    "\n",
+    "def save_to_file(content, file_name):\n",
+    "  with open(file_name, 'w') as f:\n",
+    "    f.write(content)\n",
+    "\n",
+    "# Install apache-beam.\n",
+    "run('pip install --quiet apache-beam')\n",
+    "\n",
+    "# Copy the input files into the local file system.\n",
+    "run('mkdir -p data')\n",
+    "run('gsutil cp gs://dataflow-samples/shakespeare/kinglear.txt 
data/kinglear.txt')\n",
+    "run('gsutil cp 
gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection 
data/SMSSpamCollection.csv')"
+   ],
+   "execution_count": 134,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      ">> pip install --quiet apache-beam\n",
+      "\n",
+      ">> mkdir -p data\n",
+      "\n",
+      ">> gsutil cp gs://dataflow-samples/shakespeare/kinglear.txt 
data/kinglear.txt\n",
+      "Copying gs://dataflow-samples/shakespeare/kinglear.txt...\r\n",
+      "/ [1 files][153.6 KiB/153.6 KiB]                                        
        \r\n",
+      "Operation completed over 1 objects/153.6 KiB.                           
         \r\n",
+      "\n",
+      ">> gsutil cp 
gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection 
data/SMSSpamCollection.csv\n",
+      "Copying 
gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection...\r\n",
+      "/ [1 files][466.7 KiB/466.7 KiB]                                        
        \r\n",
+      "Operation completed over 1 objects/466.7 KiB.                           
         \r\n",
+      "\n"
+     ]
+    }
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Inspect the data\n",
+    "We'll be working with 2 datasets. We'll use `kinglear.txt` for the first 
example - word count, and `SMSSpamCollection.csv` for the second and third.\n",
+    "Let's first take a loot at the `kinglear.txt` dataset."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 135,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      ">> head data/kinglear.txt\n",
+      "\tKING LEAR\r\n",
+      "\r\n",
+      "\r\n",
+      "\tDRAMATIS PERSONAE\r\n",
+      "\r\n",
+      "\r\n",
+      "LEAR\tking of Britain  (KING LEAR:)\r\n",
+      "\r\n",
+      "KING OF FRANCE:\r\n",
+      "\r\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "run('head data/kinglear.txt')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "This is just a `txt` file - it contains lines of text.\n",
+    "Let's take a look at the other dataset."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 136,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      ">> head data/SMSSpamCollection.csv\n",
+      "ham\tGo until jurong point, crazy.. Available only in bugis n great 
world la e buffet... Cine there got amore wat...\r\n",
+      "ham\tOk lar... Joking wif u oni...\r\n",
+      "spam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 
2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 
08452810075over18's\r\n",
+      "ham\tU dun say so early hor... U c already then say...\r\n",
+      "ham\tNah I don't think he goes to usf, he lives around here though\r\n",
+      "spam\tFreeMsg Hey there darling it's been 3 week's now and no word 
back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 
to rcv\r\n",
+      "ham\tEven my brother is not like to speak with me. They treat me like 
aids patent.\r\n",
+      "ham\tAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu 
Vettam)' has been set as your callertune for all Callers. Press *9 to copy your 
friends Callertune\r\n",
+      "spam\tWINNER!! As a valued network customer you have been selected to 
receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 
12 hours only.\r\n",
+      "spam\tHad your mobile 11 months or more? U R entitled to Update to the 
latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 
08002986030\r\n",
+      "\n",
+      ">> wc -l data/SMSSpamCollection.csv\n",
+      "5574 data/SMSSpamCollection.csv\r\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "run('head data/SMSSpamCollection.csv')\n",
+    "run('wc -l data/SMSSpamCollection.csv')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "This dataset is a `csv` file with 5,574 rows and 2 columns recording the 
following attributes separated by a tab sign:\n",
+    "1. `Column 1`: The label (either `ham` or `spam`)\n",
+    "2. `Column 2`: The SMS as raw text (type `string`)"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Example 1: word count\n",

Review Comment:
   Here it may be good to do really basic syntax introduction, similar to 
"structure" section in https://s.apache.org/beam-yaml-pipelines-syntax



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,646 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+  "colab": {
+   "name": "Try Apache Beam - Python",
+   "version": "0.3.2",
+   "provenance": [],
+   "collapsed_sections": [],
+   "toc_visible": true,
+   "include_colab_link": true
+  },
+  "kernelspec": {
+   "name": "python2",
+   "display_name": "Python 2"
+  }
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "view-in-github",
+    "colab_type": "text"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ],
+   "outputs": [],
+   "metadata": {
+    "cellView": "form"
+   }
+  },
+  {
+   "metadata": {
+    "id": "lNKIMlEDZ_Vw",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Try Apache Beam - YAML\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it still has a high barrier for getting started and 
authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be an overwhelming amount of 
boilerplate.\n",
+    "\n",
+    "Here we provide a simple YAML syntax for describing pipelines that does 
not require coding experience or learning how to use an SDK&mdash;any text 
editor will do.\n",
+    "\n",
+    "Please note: YAML API is still EXPERIMENTAL and subject to change.\n",
+    "\n",
+    "In this notebook, we set up your development environment and write a 
simple pipeline using YAML. We'll run it locally, using the 
[DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can 
explore other runners with the [Beam Capatibility 
Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",
+    "\n",
+    "To navigate through different sections, use the table of contents. From 
**View**  drop-down list, select **Table of contents**.\n",
+    "\n",
+    "To run a code cell, you can click the **Run cell** button at the top left 
of the cell, or by select it and press **`Shift+Enter`**. Try modifying a code 
cell and re-running it to see what happens.\n",
+    "\n",
+    "To learn more about Colab, see [Welcome to 
Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb)."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "Fz6KSQ13_3Rr",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Setup\n",
+    "\n",
+    "First, you need to set up your environment, which includes installing 
`apache-beam` and downloading a text file from Cloud Storage to your local file 
system. We are using this file to test your pipeline."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "GOOk81Jj_yUy",
+    "colab_type": "code",
+    "outputId": "d283dfb2-4f51-4fec-816b-f57b0cb9b71c",
+    "colab": {
+     "base_uri": "https://localhost:8080/";,
+     "height": 170
+    }
+   },
+   "cell_type": "code",
+   "source": [
+    "# Run and print a shell command.\n",
+    "def run(cmd):\n",
+    "  print('>> {}'.format(cmd))\n",
+    "  !{cmd}\n",
+    "  print('')\n",
+    "\n",
+    "def save_to_file(content, file_name):\n",
+    "  with open(file_name, 'w') as f:\n",
+    "    f.write(content)\n",
+    "\n",
+    "# Install apache-beam.\n",
+    "run('pip install --quiet apache-beam')\n",
+    "\n",
+    "# Copy the input files into the local file system.\n",
+    "run('mkdir -p data')\n",
+    "run('gsutil cp gs://dataflow-samples/shakespeare/kinglear.txt 
data/kinglear.txt')\n",
+    "run('gsutil cp 
gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection 
data/SMSSpamCollection.csv')"
+   ],
+   "execution_count": 134,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      ">> pip install --quiet apache-beam\n",
+      "\n",
+      ">> mkdir -p data\n",
+      "\n",
+      ">> gsutil cp gs://dataflow-samples/shakespeare/kinglear.txt 
data/kinglear.txt\n",
+      "Copying gs://dataflow-samples/shakespeare/kinglear.txt...\r\n",
+      "/ [1 files][153.6 KiB/153.6 KiB]                                        
        \r\n",
+      "Operation completed over 1 objects/153.6 KiB.                           
         \r\n",
+      "\n",
+      ">> gsutil cp 
gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection 
data/SMSSpamCollection.csv\n",
+      "Copying 
gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection...\r\n",
+      "/ [1 files][466.7 KiB/466.7 KiB]                                        
        \r\n",
+      "Operation completed over 1 objects/466.7 KiB.                           
         \r\n",
+      "\n"
+     ]
+    }
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Inspect the data\n",
+    "We'll be working with 2 datasets. We'll use `kinglear.txt` for the first 
example - word count, and `SMSSpamCollection.csv` for the second and third.\n",
+    "Let's first take a loot at the `kinglear.txt` dataset."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 135,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      ">> head data/kinglear.txt\n",
+      "\tKING LEAR\r\n",
+      "\r\n",
+      "\r\n",
+      "\tDRAMATIS PERSONAE\r\n",
+      "\r\n",
+      "\r\n",
+      "LEAR\tking of Britain  (KING LEAR:)\r\n",
+      "\r\n",
+      "KING OF FRANCE:\r\n",
+      "\r\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "run('head data/kinglear.txt')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "This is just a `txt` file - it contains lines of text.\n",
+    "Let's take a look at the other dataset."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 136,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      ">> head data/SMSSpamCollection.csv\n",
+      "ham\tGo until jurong point, crazy.. Available only in bugis n great 
world la e buffet... Cine there got amore wat...\r\n",
+      "ham\tOk lar... Joking wif u oni...\r\n",
+      "spam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 
2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 
08452810075over18's\r\n",
+      "ham\tU dun say so early hor... U c already then say...\r\n",
+      "ham\tNah I don't think he goes to usf, he lives around here though\r\n",
+      "spam\tFreeMsg Hey there darling it's been 3 week's now and no word 
back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 
to rcv\r\n",
+      "ham\tEven my brother is not like to speak with me. They treat me like 
aids patent.\r\n",
+      "ham\tAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu 
Vettam)' has been set as your callertune for all Callers. Press *9 to copy your 
friends Callertune\r\n",
+      "spam\tWINNER!! As a valued network customer you have been selected to 
receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 
12 hours only.\r\n",
+      "spam\tHad your mobile 11 months or more? U R entitled to Update to the 
latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 
08002986030\r\n",
+      "\n",
+      ">> wc -l data/SMSSpamCollection.csv\n",
+      "5574 data/SMSSpamCollection.csv\r\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "run('head data/SMSSpamCollection.csv')\n",
+    "run('wc -l data/SMSSpamCollection.csv')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "This dataset is a `csv` file with 5,574 rows and 2 columns recording the 
following attributes separated by a tab sign:\n",
+    "1. `Column 1`: The label (either `ham` or `spam`)\n",
+    "2. `Column 2`: The SMS as raw text (type `string`)"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Example 1: word count\n",
+    "In this popular introductory exercise, we will build a pipeline that 
reads lines of text from the input dataset `kinglear.txt` and counts the number 
of times each word appears in the text.\n",
+    "To start, we'll create a `yaml` file specifying our pipeline."

Review Comment:
   Or maybe .yaml?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] amotley commented on a diff in pull request #27284: Yaml API: Day Zero tutorial notebook

Reply via email to