[GitHub] [beam] amotley commented on a diff in pull request #27284: Yaml API: Day Zero tutorial notebook

via GitHub Wed, 28 Jun 2023 13:23:57 -0700


amotley commented on code in PR #27284:
URL: https://github.com/apache/beam/pull/27284#discussion_r1245700773



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,424 @@
+{

Review Comment:
   Can we add a landing page (similar to 
https://beam.apache.org/get-started/try-apache-beam/) which jumps to this 
collab page? 



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,424 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+  "colab": {
+   "name": "Try Apache Beam - Python",
+   "version": "0.3.2",
+   "provenance": [],
+   "collapsed_sections": [],
+   "toc_visible": true,
+   "include_colab_link": true
+  },
+  "kernelspec": {
+   "name": "python2",
+   "display_name": "Python 2"
+  }
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "view-in-github",
+    "colab_type": "text"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ],
+   "outputs": [],
+   "metadata": {
+    "cellView": "form"
+   }
+  },
+  {
+   "metadata": {
+    "id": "lNKIMlEDZ_Vw",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Try Apache Beam - Yaml\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it often still has too high a barrier for getting started 
and authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be an overwhelming amount of 
boilerplate for some (though 
https://beam.apache.org/blog/beam-starter-projects/ has gone a long way in 
making this easier).\n",
+    "\n",
+    "Here we provide a simple declarative syntax for describing pipelines that 
does not require coding experience or learning how to use an SDK&mdash;any text 
editor will do. Some installation may be required to actually *execute* a 
pipeline, but we envision various services (such as Dataflow) to accept yaml 
pipelines directly obviating the need for even that in the future. We also 
anticipate the ability to generate code directly from these higher-level yaml 
descriptions, should one want to graduate to a full Beam SDK (and possibly the 
other direction as well as far as possible).\n",

Review Comment:
   ```suggestion
       "Here we provide a simple YAML syntax for describing pipelines that does 
not require coding experience or learning how to use an SDK&mdash;any text 
editor will do. Some installation may be required to actually *execute* a 
pipeline, but we envision various services (such as Dataflow) to accept yaml 
pipelines directly obviating the need for even that in the future. We also 
anticipate the ability to generate code directly from these higher-level yaml 
descriptions, should one want to graduate to a full Beam SDK (and possibly the 
other direction as well as far as possible).\n",
   ```
   
   
   Let's say "YAML syntax" instead of "declarative syntax". There have been 
some discussions on branding and right now we're not sure declarative is the 
right description.



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,424 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+  "colab": {
+   "name": "Try Apache Beam - Python",
+   "version": "0.3.2",
+   "provenance": [],
+   "collapsed_sections": [],
+   "toc_visible": true,
+   "include_colab_link": true
+  },
+  "kernelspec": {
+   "name": "python2",
+   "display_name": "Python 2"
+  }
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "view-in-github",
+    "colab_type": "text"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ],
+   "outputs": [],
+   "metadata": {
+    "cellView": "form"
+   }
+  },
+  {
+   "metadata": {
+    "id": "lNKIMlEDZ_Vw",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Try Apache Beam - Yaml\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it often still has too high a barrier for getting started 
and authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be an overwhelming amount of 
boilerplate for some (though 
https://beam.apache.org/blog/beam-starter-projects/ has gone a long way in 
making this easier).\n",
+    "\n",
+    "Here we provide a simple declarative syntax for describing pipelines that 
does not require coding experience or learning how to use an SDK&mdash;any text 
editor will do. Some installation may be required to actually *execute* a 
pipeline, but we envision various services (such as Dataflow) to accept yaml 
pipelines directly obviating the need for even that in the future. We also 
anticipate the ability to generate code directly from these higher-level yaml 
descriptions, should one want to graduate to a full Beam SDK (and possibly the 
other direction as well as far as possible).\n",
+    "\n",
+    "In this notebook, we set up your development environment and write a 
simple pipeline using Yaml API. We'll run it locally, using the 
[DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can 
explore other runners with the [Beam Capatibility 
Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",

Review Comment:
   ```suggestion
       "In this notebook, we set up your development environment and write a 
simple pipeline using YAML. We'll run it locally, using the 
[DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can 
explore other runners with the [Beam Capatibility 
Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",
   ```



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,424 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+  "colab": {
+   "name": "Try Apache Beam - Python",
+   "version": "0.3.2",
+   "provenance": [],
+   "collapsed_sections": [],
+   "toc_visible": true,
+   "include_colab_link": true
+  },
+  "kernelspec": {
+   "name": "python2",
+   "display_name": "Python 2"
+  }
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "view-in-github",
+    "colab_type": "text"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ],
+   "outputs": [],
+   "metadata": {
+    "cellView": "form"
+   }
+  },
+  {
+   "metadata": {
+    "id": "lNKIMlEDZ_Vw",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Try Apache Beam - Yaml\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it often still has too high a barrier for getting started 
and authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be an overwhelming amount of 
boilerplate for some (though 
https://beam.apache.org/blog/beam-starter-projects/ has gone a long way in 
making this easier).\n",
+    "\n",
+    "Here we provide a simple declarative syntax for describing pipelines that 
does not require coding experience or learning how to use an SDK&mdash;any text 
editor will do. Some installation may be required to actually *execute* a 
pipeline, but we envision various services (such as Dataflow) to accept yaml 
pipelines directly obviating the need for even that in the future. We also 
anticipate the ability to generate code directly from these higher-level yaml 
descriptions, should one want to graduate to a full Beam SDK (and possibly the 
other direction as well as far as possible).\n",
+    "\n",
+    "In this notebook, we set up your development environment and write a 
simple pipeline using Yaml API. We'll run it locally, using the 
[DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can 
explore other runners with the [Beam Capatibility 
Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",
+    "\n",
+    "To navigate through different sections, use the table of contents. From 
**View**  drop-down list, select **Table of contents**.\n",
+    "\n",
+    "To run a code cell, you can click the **Run cell** button at the top left 
of the cell, or by select it and press **`Shift+Enter`**. Try modifying a code 
cell and re-running it to see what happens.\n",
+    "\n",
+    "To learn more about Colab, see [Welcome to 
Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb)."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "Fz6KSQ13_3Rr",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Setup\n",
+    "\n",
+    "First, you need to set up your environment, which includes installing 
`apache-beam` and downloading a text file from Cloud Storage to your local file 
system. We are using this file to test your pipeline."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "GOOk81Jj_yUy",
+    "colab_type": "code",
+    "outputId": "d283dfb2-4f51-4fec-816b-f57b0cb9b71c",
+    "colab": {
+     "base_uri": "https://localhost:8080/";,
+     "height": 170
+    }
+   },
+   "cell_type": "code",
+   "source": [
+    "# Run and print a shell command.\n",
+    "def run(cmd):\n",
+    "  print('>> {}'.format(cmd))\n",
+    "  !{cmd}\n",
+    "  print('')\n",
+    "\n",
+    "def save_to_file(content, file_name):\n",
+    "  with open(file_name, 'w') as f:\n",
+    "    f.write(content)\n",
+    "\n",
+    "# Install apache-beam.\n",
+    "run('pip install --quiet apache-beam')\n",
+    "\n",
+    "# Copy the input file into the local file system.\n",
+    "run('mkdir -p data')\n",
+    "run('gsutil cp 
gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection 
data/SMSSpamCollection.csv')"
+   ],
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Inspect the data\n",
+    "Let’s see how our data looks like."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "run('head data/SMSSpamCollection.csv')\n",
+    "run('wc -l data/SMSSpamCollection.csv')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "This dataset is a `csv` file with 5,574 rows and 2 columns recording the 
following attributes separated by a tab sign:\n",
+    "1. `Column 1`: The label (either `ham` or `spam`)\n",
+    "2. `Column 2`: The SMS as raw text (type `string`)"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## First pipeline\n",
+    "We’ll start with creating a pipeline which loads the data, filters out 
valid messages leaving spam, and saves only valid lines to a file.\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "pipeline = '''\n",
+    "pipeline:\n",
+    "  - type: ReadFromText\n",
+    "    name: SmsData\n",
+    "    file_pattern: data/SMSSpamCollection.csv\n",
+    "\n",
+    "  - type: PyMap\n",
+    "    name: SplitLine\n",
+    "    input: SmsData\n",
+    "    fn: 'lambda line: line.split(\"\\\\t\")'\n",
+    "\n",
+    "  - type: PyFilter\n",
+    "    name: KeepSpam\n",
+    "    input: SplitLine\n",
+    "    keep: 'lambda row: row[0] == \"spam\"'\n",
+    "\n",
+    "  - type: WriteToText\n",
+    "    name: SaveToText\n",
+    "    input: KeepSpam\n",
+    "    file_path_prefix: \"data/result-pipeline-01\"\n",
+    "    file_name_suffix: \".txt\"\n",
+    "'''\n",
+    "save_to_file(pipeline, 'pipeline-01.yaml')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "In this example, each transformation contains the 'input' key, but if the 
pipeline is linear, such as ours, we can let the inputs be implicit by 
designating the pipeline as a `chain` type.\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "pipeline = '''\n",
+    "pipeline:\n",
+    "  type: chain\n",
+    "  transforms:\n",
+    "    - type: ReadFromText\n",
+    "      name: SmsData\n",
+    "      file_pattern: data/SMSSpamCollection.csv\n",
+    "\n",
+    "    - type: PyMap\n",
+    "      name: SplitLine\n",
+    "      fn: 'lambda line: line.split(\"\\\\t\")'\n",
+    "\n",
+    "    - type: PyFilter\n",
+    "      name: KeepSpam\n",
+    "      keep: 'lambda row: row[0] == \"spam\"'\n",
+    "\n",
+    "    - type: WriteToText\n",
+    "      name: SaveToText\n",
+    "      file_path_prefix: \"data/result-pipeline-01\"\n",
+    "      file_name_suffix: \".txt\"\n",
+    "'''\n",
+    "save_to_file(pipeline, 'pipeline-01-chain.yaml')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "To run the pipeline locally, using a DirectRunner, you need to run the 
yaml's main python script, passing the `pipeline-01-chain.yaml` (or 
`pipeline-01.yaml`) file as an input:"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "run('python -m apache_beam.yaml.main 
--pipeline_spec_file=pipeline-01-chain.yaml')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Let's verify the results and see the content of the output file."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "run('head data/result-pipeline-01-00000-of-00001.txt')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "If everything went well, you should see only spam messages from our input 
dataset. Congratulations, onto the next example!\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Count words in spam messages, select top 10 popular words and write 
results to a file\n",

Review Comment:
   something like, "Example 2: <description>"



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,424 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+  "colab": {
+   "name": "Try Apache Beam - Python",
+   "version": "0.3.2",
+   "provenance": [],
+   "collapsed_sections": [],
+   "toc_visible": true,
+   "include_colab_link": true
+  },
+  "kernelspec": {
+   "name": "python2",
+   "display_name": "Python 2"
+  }
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "view-in-github",
+    "colab_type": "text"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ],
+   "outputs": [],
+   "metadata": {
+    "cellView": "form"
+   }
+  },
+  {
+   "metadata": {
+    "id": "lNKIMlEDZ_Vw",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Try Apache Beam - Yaml\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it often still has too high a barrier for getting started 
and authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be an overwhelming amount of 
boilerplate for some (though 
https://beam.apache.org/blog/beam-starter-projects/ has gone a long way in 
making this easier).\n",

Review Comment:
   ```suggestion
       "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it often still has too high a barrier for getting started 
and authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be an overwhelming amount of 
boilerplate.\n",
   ```
   



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,424 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+  "colab": {
+   "name": "Try Apache Beam - Python",
+   "version": "0.3.2",
+   "provenance": [],
+   "collapsed_sections": [],
+   "toc_visible": true,
+   "include_colab_link": true
+  },
+  "kernelspec": {
+   "name": "python2",
+   "display_name": "Python 2"
+  }
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "view-in-github",
+    "colab_type": "text"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ],
+   "outputs": [],
+   "metadata": {
+    "cellView": "form"
+   }
+  },
+  {
+   "metadata": {
+    "id": "lNKIMlEDZ_Vw",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Try Apache Beam - Yaml\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it often still has too high a barrier for getting started 
and authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be an overwhelming amount of 
boilerplate for some (though 
https://beam.apache.org/blog/beam-starter-projects/ has gone a long way in 
making this easier).\n",
+    "\n",
+    "Here we provide a simple declarative syntax for describing pipelines that 
does not require coding experience or learning how to use an SDK&mdash;any text 
editor will do. Some installation may be required to actually *execute* a 
pipeline, but we envision various services (such as Dataflow) to accept yaml 
pipelines directly obviating the need for even that in the future. We also 
anticipate the ability to generate code directly from these higher-level yaml 
descriptions, should one want to graduate to a full Beam SDK (and possibly the 
other direction as well as far as possible).\n",

Review Comment:
   ```suggestion
       "Here we provide a simple declarative syntax for describing pipelines 
that does not require coding experience or learning how to use an SDK&mdash;any 
text editor will do.\n",
       
   ```
   
   Let's remove this since we don't have these capabilities yet



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,424 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+  "colab": {
+   "name": "Try Apache Beam - Python",
+   "version": "0.3.2",
+   "provenance": [],
+   "collapsed_sections": [],
+   "toc_visible": true,
+   "include_colab_link": true
+  },
+  "kernelspec": {
+   "name": "python2",
+   "display_name": "Python 2"
+  }
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "view-in-github",
+    "colab_type": "text"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ],
+   "outputs": [],
+   "metadata": {
+    "cellView": "form"
+   }
+  },
+  {
+   "metadata": {
+    "id": "lNKIMlEDZ_Vw",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Try Apache Beam - Yaml\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it often still has too high a barrier for getting started 
and authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be an overwhelming amount of 
boilerplate for some (though 
https://beam.apache.org/blog/beam-starter-projects/ has gone a long way in 
making this easier).\n",
+    "\n",
+    "Here we provide a simple declarative syntax for describing pipelines that 
does not require coding experience or learning how to use an SDK&mdash;any text 
editor will do. Some installation may be required to actually *execute* a 
pipeline, but we envision various services (such as Dataflow) to accept yaml 
pipelines directly obviating the need for even that in the future. We also 
anticipate the ability to generate code directly from these higher-level yaml 
descriptions, should one want to graduate to a full Beam SDK (and possibly the 
other direction as well as far as possible).\n",
+    "\n",
+    "In this notebook, we set up your development environment and write a 
simple pipeline using Yaml API. We'll run it locally, using the 
[DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can 
explore other runners with the [Beam Capatibility 
Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",
+    "\n",
+    "To navigate through different sections, use the table of contents. From 
**View**  drop-down list, select **Table of contents**.\n",
+    "\n",
+    "To run a code cell, you can click the **Run cell** button at the top left 
of the cell, or by select it and press **`Shift+Enter`**. Try modifying a code 
cell and re-running it to see what happens.\n",
+    "\n",
+    "To learn more about Colab, see [Welcome to 
Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb)."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "Fz6KSQ13_3Rr",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Setup\n",
+    "\n",
+    "First, you need to set up your environment, which includes installing 
`apache-beam` and downloading a text file from Cloud Storage to your local file 
system. We are using this file to test your pipeline."

Review Comment:
   Can we add a pointer to some troubleshooting help if this doesn't work?



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,424 @@
+{

Review Comment:
   It would be good to get someone from the Tech Writer team to review too. You 
can check w/ rszper@.



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,424 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+  "colab": {
+   "name": "Try Apache Beam - Python",
+   "version": "0.3.2",
+   "provenance": [],
+   "collapsed_sections": [],
+   "toc_visible": true,
+   "include_colab_link": true
+  },
+  "kernelspec": {
+   "name": "python2",
+   "display_name": "Python 2"
+  }
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "view-in-github",
+    "colab_type": "text"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ],
+   "outputs": [],
+   "metadata": {
+    "cellView": "form"
+   }
+  },
+  {
+   "metadata": {
+    "id": "lNKIMlEDZ_Vw",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Try Apache Beam - Yaml\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it often still has too high a barrier for getting started 
and authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be an overwhelming amount of 
boilerplate for some (though 
https://beam.apache.org/blog/beam-starter-projects/ has gone a long way in 
making this easier).\n",

Review Comment:
   Let's add a disclaimer somewhere this is is still EXPERIMENTAL



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,424 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+  "colab": {
+   "name": "Try Apache Beam - Python",
+   "version": "0.3.2",
+   "provenance": [],
+   "collapsed_sections": [],
+   "toc_visible": true,
+   "include_colab_link": true
+  },
+  "kernelspec": {
+   "name": "python2",
+   "display_name": "Python 2"
+  }
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "view-in-github",
+    "colab_type": "text"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ],
+   "outputs": [],
+   "metadata": {
+    "cellView": "form"
+   }
+  },
+  {
+   "metadata": {
+    "id": "lNKIMlEDZ_Vw",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Try Apache Beam - Yaml\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it often still has too high a barrier for getting started 
and authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be an overwhelming amount of 
boilerplate for some (though 
https://beam.apache.org/blog/beam-starter-projects/ has gone a long way in 
making this easier).\n",
+    "\n",
+    "Here we provide a simple declarative syntax for describing pipelines that 
does not require coding experience or learning how to use an SDK&mdash;any text 
editor will do. Some installation may be required to actually *execute* a 
pipeline, but we envision various services (such as Dataflow) to accept yaml 
pipelines directly obviating the need for even that in the future. We also 
anticipate the ability to generate code directly from these higher-level yaml 
descriptions, should one want to graduate to a full Beam SDK (and possibly the 
other direction as well as far as possible).\n",
+    "\n",
+    "In this notebook, we set up your development environment and write a 
simple pipeline using Yaml API. We'll run it locally, using the 
[DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can 
explore other runners with the [Beam Capatibility 
Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",
+    "\n",
+    "To navigate through different sections, use the table of contents. From 
**View**  drop-down list, select **Table of contents**.\n",
+    "\n",
+    "To run a code cell, you can click the **Run cell** button at the top left 
of the cell, or by select it and press **`Shift+Enter`**. Try modifying a code 
cell and re-running it to see what happens.\n",
+    "\n",
+    "To learn more about Colab, see [Welcome to 
Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb)."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "Fz6KSQ13_3Rr",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Setup\n",
+    "\n",
+    "First, you need to set up your environment, which includes installing 
`apache-beam` and downloading a text file from Cloud Storage to your local file 
system. We are using this file to test your pipeline."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "GOOk81Jj_yUy",
+    "colab_type": "code",
+    "outputId": "d283dfb2-4f51-4fec-816b-f57b0cb9b71c",
+    "colab": {
+     "base_uri": "https://localhost:8080/";,
+     "height": 170
+    }
+   },
+   "cell_type": "code",
+   "source": [
+    "# Run and print a shell command.\n",
+    "def run(cmd):\n",
+    "  print('>> {}'.format(cmd))\n",
+    "  !{cmd}\n",
+    "  print('')\n",
+    "\n",
+    "def save_to_file(content, file_name):\n",
+    "  with open(file_name, 'w') as f:\n",
+    "    f.write(content)\n",
+    "\n",
+    "# Install apache-beam.\n",
+    "run('pip install --quiet apache-beam')\n",
+    "\n",
+    "# Copy the input file into the local file system.\n",
+    "run('mkdir -p data')\n",
+    "run('gsutil cp 
gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection 
data/SMSSpamCollection.csv')"
+   ],
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Inspect the data\n",
+    "Let’s see how our data looks like."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "run('head data/SMSSpamCollection.csv')\n",
+    "run('wc -l data/SMSSpamCollection.csv')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "This dataset is a `csv` file with 5,574 rows and 2 columns recording the 
following attributes separated by a tab sign:\n",
+    "1. `Column 1`: The label (either `ham` or `spam`)\n",
+    "2. `Column 2`: The SMS as raw text (type `string`)"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## First pipeline\n",

Review Comment:
   Maybe something like "Example 1: <description of what pipeline does>"



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,424 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+  "colab": {
+   "name": "Try Apache Beam - Python",
+   "version": "0.3.2",
+   "provenance": [],
+   "collapsed_sections": [],
+   "toc_visible": true,
+   "include_colab_link": true
+  },
+  "kernelspec": {
+   "name": "python2",
+   "display_name": "Python 2"
+  }
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "view-in-github",
+    "colab_type": "text"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ],
+   "outputs": [],
+   "metadata": {
+    "cellView": "form"
+   }
+  },
+  {
+   "metadata": {
+    "id": "lNKIMlEDZ_Vw",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Try Apache Beam - Yaml\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it often still has too high a barrier for getting started 
and authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be an overwhelming amount of 
boilerplate for some (though 
https://beam.apache.org/blog/beam-starter-projects/ has gone a long way in 
making this easier).\n",
+    "\n",
+    "Here we provide a simple declarative syntax for describing pipelines that 
does not require coding experience or learning how to use an SDK&mdash;any text 
editor will do. Some installation may be required to actually *execute* a 
pipeline, but we envision various services (such as Dataflow) to accept yaml 
pipelines directly obviating the need for even that in the future. We also 
anticipate the ability to generate code directly from these higher-level yaml 
descriptions, should one want to graduate to a full Beam SDK (and possibly the 
other direction as well as far as possible).\n",
+    "\n",
+    "In this notebook, we set up your development environment and write a 
simple pipeline using Yaml API. We'll run it locally, using the 
[DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can 
explore other runners with the [Beam Capatibility 
Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",
+    "\n",
+    "To navigate through different sections, use the table of contents. From 
**View**  drop-down list, select **Table of contents**.\n",
+    "\n",
+    "To run a code cell, you can click the **Run cell** button at the top left 
of the cell, or by select it and press **`Shift+Enter`**. Try modifying a code 
cell and re-running it to see what happens.\n",
+    "\n",
+    "To learn more about Colab, see [Welcome to 
Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb)."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "Fz6KSQ13_3Rr",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Setup\n",
+    "\n",
+    "First, you need to set up your environment, which includes installing 
`apache-beam` and downloading a text file from Cloud Storage to your local file 
system. We are using this file to test your pipeline."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "GOOk81Jj_yUy",
+    "colab_type": "code",
+    "outputId": "d283dfb2-4f51-4fec-816b-f57b0cb9b71c",
+    "colab": {
+     "base_uri": "https://localhost:8080/";,
+     "height": 170
+    }
+   },
+   "cell_type": "code",
+   "source": [
+    "# Run and print a shell command.\n",
+    "def run(cmd):\n",
+    "  print('>> {}'.format(cmd))\n",
+    "  !{cmd}\n",
+    "  print('')\n",
+    "\n",
+    "def save_to_file(content, file_name):\n",
+    "  with open(file_name, 'w') as f:\n",
+    "    f.write(content)\n",
+    "\n",
+    "# Install apache-beam.\n",
+    "run('pip install --quiet apache-beam')\n",
+    "\n",
+    "# Copy the input file into the local file system.\n",
+    "run('mkdir -p data')\n",
+    "run('gsutil cp 
gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection 
data/SMSSpamCollection.csv')"
+   ],
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Inspect the data\n",
+    "Let’s see how our data looks like."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "run('head data/SMSSpamCollection.csv')\n",
+    "run('wc -l data/SMSSpamCollection.csv')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "This dataset is a `csv` file with 5,574 rows and 2 columns recording the 
following attributes separated by a tab sign:\n",
+    "1. `Column 1`: The label (either `ham` or `spam`)\n",
+    "2. `Column 2`: The SMS as raw text (type `string`)"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## First pipeline\n",
+    "We’ll start with creating a pipeline which loads the data, filters out 
valid messages leaving spam, and saves only valid lines to a file.\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "pipeline = '''\n",
+    "pipeline:\n",
+    "  - type: ReadFromText\n",
+    "    name: SmsData\n",
+    "    file_pattern: data/SMSSpamCollection.csv\n",
+    "\n",
+    "  - type: PyMap\n",
+    "    name: SplitLine\n",
+    "    input: SmsData\n",
+    "    fn: 'lambda line: line.split(\"\\\\t\")'\n",
+    "\n",
+    "  - type: PyFilter\n",
+    "    name: KeepSpam\n",
+    "    input: SplitLine\n",
+    "    keep: 'lambda row: row[0] == \"spam\"'\n",
+    "\n",
+    "  - type: WriteToText\n",
+    "    name: SaveToText\n",
+    "    input: KeepSpam\n",
+    "    file_path_prefix: \"data/result-pipeline-01\"\n",
+    "    file_name_suffix: \".txt\"\n",
+    "'''\n",
+    "save_to_file(pipeline, 'pipeline-01.yaml')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "In this example, each transformation contains the 'input' key, but if the 
pipeline is linear, such as ours, we can let the inputs be implicit by 
designating the pipeline as a `chain` type.\n"

Review Comment:
   Can we remove 'chain' from this tutorial? It could be unnecessary overhead 
for new YAML users and we can cover in the advanced section of the 
documentation instead.



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,424 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+  "colab": {
+   "name": "Try Apache Beam - Python",
+   "version": "0.3.2",
+   "provenance": [],
+   "collapsed_sections": [],
+   "toc_visible": true,
+   "include_colab_link": true
+  },
+  "kernelspec": {
+   "name": "python2",
+   "display_name": "Python 2"
+  }
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "view-in-github",
+    "colab_type": "text"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ],
+   "outputs": [],
+   "metadata": {
+    "cellView": "form"
+   }
+  },
+  {
+   "metadata": {
+    "id": "lNKIMlEDZ_Vw",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Try Apache Beam - Yaml\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it often still has too high a barrier for getting started 
and authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be an overwhelming amount of 
boilerplate for some (though 
https://beam.apache.org/blog/beam-starter-projects/ has gone a long way in 
making this easier).\n",
+    "\n",
+    "Here we provide a simple declarative syntax for describing pipelines that 
does not require coding experience or learning how to use an SDK&mdash;any text 
editor will do. Some installation may be required to actually *execute* a 
pipeline, but we envision various services (such as Dataflow) to accept yaml 
pipelines directly obviating the need for even that in the future. We also 
anticipate the ability to generate code directly from these higher-level yaml 
descriptions, should one want to graduate to a full Beam SDK (and possibly the 
other direction as well as far as possible).\n",
+    "\n",
+    "In this notebook, we set up your development environment and write a 
simple pipeline using Yaml API. We'll run it locally, using the 
[DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can 
explore other runners with the [Beam Capatibility 
Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",
+    "\n",
+    "To navigate through different sections, use the table of contents. From 
**View**  drop-down list, select **Table of contents**.\n",
+    "\n",
+    "To run a code cell, you can click the **Run cell** button at the top left 
of the cell, or by select it and press **`Shift+Enter`**. Try modifying a code 
cell and re-running it to see what happens.\n",
+    "\n",
+    "To learn more about Colab, see [Welcome to 
Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb)."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "Fz6KSQ13_3Rr",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Setup\n",
+    "\n",
+    "First, you need to set up your environment, which includes installing 
`apache-beam` and downloading a text file from Cloud Storage to your local file 
system. We are using this file to test your pipeline."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "GOOk81Jj_yUy",
+    "colab_type": "code",
+    "outputId": "d283dfb2-4f51-4fec-816b-f57b0cb9b71c",
+    "colab": {
+     "base_uri": "https://localhost:8080/";,
+     "height": 170
+    }
+   },
+   "cell_type": "code",
+   "source": [
+    "# Run and print a shell command.\n",
+    "def run(cmd):\n",
+    "  print('>> {}'.format(cmd))\n",
+    "  !{cmd}\n",
+    "  print('')\n",
+    "\n",
+    "def save_to_file(content, file_name):\n",
+    "  with open(file_name, 'w') as f:\n",
+    "    f.write(content)\n",
+    "\n",
+    "# Install apache-beam.\n",
+    "run('pip install --quiet apache-beam')\n",
+    "\n",
+    "# Copy the input file into the local file system.\n",
+    "run('mkdir -p data')\n",
+    "run('gsutil cp 
gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection 
data/SMSSpamCollection.csv')"
+   ],
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Inspect the data\n",
+    "Let’s see how our data looks like."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "run('head data/SMSSpamCollection.csv')\n",
+    "run('wc -l data/SMSSpamCollection.csv')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "This dataset is a `csv` file with 5,574 rows and 2 columns recording the 
following attributes separated by a tab sign:\n",
+    "1. `Column 1`: The label (either `ham` or `spam`)\n",
+    "2. `Column 2`: The SMS as raw text (type `string`)"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## First pipeline\n",
+    "We’ll start with creating a pipeline which loads the data, filters out 
valid messages leaving spam, and saves only valid lines to a file.\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "pipeline = '''\n",
+    "pipeline:\n",
+    "  - type: ReadFromText\n",
+    "    name: SmsData\n",
+    "    file_pattern: data/SMSSpamCollection.csv\n",
+    "\n",
+    "  - type: PyMap\n",
+    "    name: SplitLine\n",
+    "    input: SmsData\n",
+    "    fn: 'lambda line: line.split(\"\\\\t\")'\n",
+    "\n",
+    "  - type: PyFilter\n",
+    "    name: KeepSpam\n",
+    "    input: SplitLine\n",
+    "    keep: 'lambda row: row[0] == \"spam\"'\n",
+    "\n",
+    "  - type: WriteToText\n",
+    "    name: SaveToText\n",
+    "    input: KeepSpam\n",
+    "    file_path_prefix: \"data/result-pipeline-01\"\n",
+    "    file_name_suffix: \".txt\"\n",
+    "'''\n",
+    "save_to_file(pipeline, 'pipeline-01.yaml')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "In this example, each transformation contains the 'input' key, but if the 
pipeline is linear, such as ours, we can let the inputs be implicit by 
designating the pipeline as a `chain` type.\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "pipeline = '''\n",
+    "pipeline:\n",
+    "  type: chain\n",
+    "  transforms:\n",
+    "    - type: ReadFromText\n",
+    "      name: SmsData\n",
+    "      file_pattern: data/SMSSpamCollection.csv\n",
+    "\n",
+    "    - type: PyMap\n",
+    "      name: SplitLine\n",
+    "      fn: 'lambda line: line.split(\"\\\\t\")'\n",
+    "\n",
+    "    - type: PyFilter\n",
+    "      name: KeepSpam\n",
+    "      keep: 'lambda row: row[0] == \"spam\"'\n",
+    "\n",
+    "    - type: WriteToText\n",
+    "      name: SaveToText\n",
+    "      file_path_prefix: \"data/result-pipeline-01\"\n",
+    "      file_name_suffix: \".txt\"\n",
+    "'''\n",
+    "save_to_file(pipeline, 'pipeline-01-chain.yaml')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "To run the pipeline locally, using a DirectRunner, you need to run the 
yaml's main python script, passing the `pipeline-01-chain.yaml` (or 
`pipeline-01.yaml`) file as an input:"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "run('python -m apache_beam.yaml.main 
--pipeline_spec_file=pipeline-01-chain.yaml')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Let's verify the results and see the content of the output file."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "run('head data/result-pipeline-01-00000-of-00001.txt')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "If everything went well, you should see only spam messages from our input 
dataset. Congratulations, onto the next example!\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Count words in spam messages, select top 10 popular words and write 
results to a file\n",
+    "\n",
+    "We'd like to write a pipeline which counts words occurring in spam 
messages, selects the most popular ones and writes the result to a file.\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "pipeline = '''\n",
+    "pipeline:\n",
+    "  type: chain\n",
+    "\n",
+    "  transforms:\n",
+    "    # Read input data. Each line from the csv file is a String.\n",
+    "    - type: ReadFromText\n",
+    "      name: SmsData\n",
+    "      file_pattern: data/SMSSpamCollection.csv\n",
+    "\n",
+    "    # Split each line into an array, where the first element is message 
label (ham or spam) and the second is the content of the message.\n",
+    "    - type: PyMap\n",
+    "      name: SplitLine\n",
+    "      fn: 'lambda line: line.split(\"\\\\t\")'\n",
+    "\n",
+    "    # Select only the rows that contain spam messages, based on the 
label.\n",
+    "    - type: PyFilter\n",
+    "      name: SpamMessages\n",
+    "      keep: 'lambda row: row[0] == \"spam\"'\n",
+    "\n",
+    "    # Using a regex, we'll split the content of the message (one long 
string) into words (list of strings)\n",
+    "    - type: PyFlatMap\n",
+    "      name: FindWords\n",
+    "      fn: |\n",
+    "        import re\n",
+    "        lambda line: re.findall(r\"[a-zA-Z]+\", line[1])\n",
+    "\n",
+    "    # Transforming each word to lower case and combining it with a '1'. 
Result of this step are pairs (word: 1).\n",
+    "    - type: PyMap\n",
+    "      name: PairLoweredWordsWith1\n",
+    "      fn: 'lambda word: (word.lower(), 1)'\n",
+    "\n",
+    "    # Using SumPerKey transform, we'll calculate the occurence of each 
word.\n",
+    "    - type: SumPerKey\n",
+    "      name: GroupAndSum\n",
+    "\n",
+    "    # Select 10 most popular words. Input format to this step is a tuple 
(word: count),\n",
+    "    # so we provide the count (row[1]) as the key to compare the 
numbers\n",
+    "    - type: TopNLargest\n",
+    "      name: Largest\n",
+    "      n: 10\n",
+    "      key: 'lambda row: row[1]'\n",
+    "\n",
+    "    # Save results to a text file.\n",
+    "    - type: WriteToText\n",
+    "      name: SaveToText\n",
+    "      file_path_prefix: \"data/result-pipeline-02\"\n",
+    "      file_name_suffix: \".txt\"\n",
+    "'''\n",
+    "save_to_file(pipeline, 'pipeline-02.yaml')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Let's run the pipeline:"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "run('python -m apache_beam.yaml.main 
--pipeline_spec_file=pipeline-02.yaml')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "To view the output:"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "run('head data/result-pipeline-02-00000-of-00001.txt')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Summary\n",

Review Comment:
   If it's not too much extra work, it may be good to include a WordCount 
example so that folks can compare word count of YAML vs different SDKS (the 
other SDK quick starts use word count).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] amotley commented on a diff in pull request #27284: Yaml API: Day Zero tutorial notebook

Reply via email to