bzablocki commented on code in PR #27284:
URL: https://github.com/apache/beam/pull/27284#discussion_r1430155626
##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,581 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "Try Apache Beam - Python",
+ "version": "0.3.2",
+ "provenance": [],
+ "collapsed_sections": [],
+ "toc_visible": true,
+ "include_colab_link": true
+ },
+ "kernelspec": {
+ "name": "python2",
+ "display_name": "Python 2"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "<a
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\"
target=\"_parent\"><img
src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In
Colab\"/></a>\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "source": [
+ "#@title ###### Licensed to the Apache Software Foundation (ASF), Version
2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License."
+ ],
+ "outputs": [],
+ "metadata": {
+ "cellView": "form"
+ }
+ },
+ {
+ "metadata": {
+ "id": "lNKIMlEDZ_Vw",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "# Try Apache Beam - YAML\n",
+ "\n",
+ "While Beam provides powerful APIs for authoring sophisticated data
processing pipelines, it still has a high barrier for getting started and
authoring simple pipelines. Even setting up the environment, installing the
dependencies, and setting up the project can be an overwhelming amount of
boilerplate.\n",
+ "\n",
+ "Here we provide a simple YAML syntax for describing pipelines that does
not require coding experience or learning how to use an SDK—any text
editor will do.\n",
+ "\n",
+ "Please note: YAML API is still EXPERIMENTAL and subject to change.\n",
+ "\n",
+ "In this notebook, we set up your development environment and write a
simple pipeline using YAML. We'll run it locally, using the
[DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can
explore other runners with the [Beam Capatibility
Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",
+ "\n",
+ "To navigate through different sections, use the table of contents. From
**View** drop-down list, select **Table of contents**.\n",
+ "\n",
+ "To run a code cell, you can click the **Run cell** button at the top left
of the cell, or by select it and press **`Shift+Enter`**. Try modifying a code
cell and re-running it to see what happens.\n",
+ "\n",
+ "To learn more about Colab, see [Welcome to
Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb)."
+ ]
+ },
+ {
+ "metadata": {
+ "id": "Fz6KSQ13_3Rr",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "# Setup\n",
+ "\n",
+ "First, you need to set up your environment, which includes installing
`apache-beam` and downloading files from Cloud Storage to your local file
system. We'll use these files as an input to the pipelines in this guide."
+ ]
+ },
+ {
+ "metadata": {
+ "id": "GOOk81Jj_yUy",
+ "colab_type": "code",
+ "outputId": "d283dfb2-4f51-4fec-816b-f57b0cb9b71c",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 170
+ }
+ },
+ "cell_type": "code",
+ "source": [
+ "# Run and print a shell command.\n",
+ "def run(cmd):\n",
+ " print('>> {}'.format(cmd))\n",
+ " !{cmd}\n",
+ " print('')\n",
+ "\n",
+ "def save_to_file(content, file_name):\n",
+ " with open(file_name, 'w') as f:\n",
+ " f.write(content)\n",
+ "\n",
+ "# Install apache-beam.\n",
+ "run('pip install --quiet apache-beam')\n",
+ "\n",
+ "# Copy the input files into the local file system.\n",
+ "run('mkdir -p data')\n",
+ "run('wget -O data/kinglear.txt
https://storage.googleapis.com/dataflow-samples/shakespeare/kinglear.txt')\n",
+ "run('wget -O data/SMSSpamCollection.csv
https://storage.googleapis.com/apache-beam-samples/SMSSpamCollection/SMSSpamCollection')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Inspect the data\n",
+ "We'll be working with 2 datasets. We'll use `kinglear.txt` for the first
example - word count, and `SMSSpamCollection.csv` for the second and third.\n",
+ "Let's first take a loot at the `kinglear.txt` dataset."
+ ],
+ "metadata": {
+ "collapsed": false
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "outputs": [],
+ "source": [
+ "run('head data/kinglear.txt')"
+ ],
+ "metadata": {
+ "collapsed": false
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "This is just a `txt` file - it contains lines of text.\n",
+ "Let's take a look at the other dataset."
+ ],
+ "metadata": {
+ "collapsed": false
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "outputs": [],
+ "source": [
+ "run('head data/SMSSpamCollection.csv')\n",
+ "run('wc -l data/SMSSpamCollection.csv')"
+ ],
+ "metadata": {
+ "collapsed": false
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "This dataset is a `csv` file with 5,574 rows and 2 columns recording the
following attributes separated by a tab sign:\n",
+ "1. `Column 1`: The label (either `ham` or `spam`)\n",
+ "2. `Column 2`: The SMS as raw text (type `string`)"
+ ],
+ "metadata": {
+ "collapsed": false
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Example 1: word count\n",
+ "In this popular introductory exercise, we will build a pipeline that
reads lines of text from the input dataset `kinglear.txt` and counts the number
of times each word appears in the text.\n",
+ "To start, we'll create a `.yaml` file specifying our pipeline."
+ ],
+ "metadata": {
+ "collapsed": false
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "outputs": [],
+ "source": [
+ "pipeline = '''\n",
+ "pipeline:\n",
+ " # Read input data. Each line from the txt file is a String.\n",
+ " - type: ReadFromText\n",
+ " name: InputText\n",
+ " config:\n",
+ " file_pattern: data/kinglear.txt\n",
+ "\n",
+ " # Using a regex, we'll split the content of the message (one long
string) into words (list of strings).\n",
+ " # The 'fn' parameter accepts functions written in Python\n",
+ " - type: PyFlatMap\n",
+ " name: FindWords\n",
+ " input: InputText\n",
+ " config:\n",
+ " fn: |\n",
+ " import re\n",
+ " lambda line: re.findall(r\"[a-zA-Z]+\", line)\n",
+ "\n",
+ " # Transforming each word to lower case and combining it with a '1'.
Result of this step are pairs (word: 1).\n",
+ " - type: PyMap\n",
+ " name: PairWordsWith1\n",
+ " input: FindWords\n",
+ " config:\n",
+ " fn: 'lambda word: (word, 1)'\n",
+ "\n",
+ " # Using CombinePerKey transform with the 'sum' function as a combine
function,\n",
+ " # we'll calculate the occurrence of each word.\n",
+ " - type: CombinePerKey\n",
+ " config:\n",
+ " combine_fn: sum\n",
+ " name: GroupAndSum\n",
+ " input: PairWordsWith1\n",
+ "\n",
+ " # Format results - each record should be represented as 'word:
count'.\n",
+ " # The 'fn' parameter accepts functions written in Python\n",
+ " - type: PyMap\n",
+ " name: FormatResults\n",
+ " input: GroupAndSum\n",
+ " config:\n",
+ " fn: \"lambda word_count_tuple: f'{word_count_tuple[0]}:
{word_count_tuple[1]}'\"\n",
+ "\n",
+ " # Save results to a text file.\n",
+ " - type: WriteToText\n",
+ " name: SaveToText\n",
+ " input: FormatResults\n",
+ " config:\n",
+ " file_path_prefix: \"data/result-pipeline-01\"\n",
+ " file_name_suffix: \".txt\"\n",
+ "'''\n",
+ "save_to_file(pipeline, 'pipeline-01.yaml')"
+ ],
+ "metadata": {
+ "collapsed": false
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Each pipeline specification has to start with a `pipeline` key on the
first line.\n",
+ "Following, there is a list of transforms, such as the first one:\n",
+ "```\n",
+ " # Read input data. Each line from the csv file is a String.\n",
+ " - type: ReadFromText\n",
+ " name: InputText\n",
+ " config:\n",
+ " file_pattern: data/kinglear.txt\n",
+ "```\n",
+ "This one reads an input file for futher processing.\n",
+ "Remember that the indentation is important here - it specifies object
hierarchy.\n",
+ "YAML also supports comments - everything after the `#` is always treated
as a comment. Use them to improve readability.\n",
+ "\n",
+ "Each operation has to specify the `type` descriptor and other fields,
such as `name` and other transform-specific parameters.\n",
+ "Take a look at the documentation for a list of available transforms and
their parameters. # todo(yaml) add link\n",
+ "\n",
+ "To link two operations, use the `input` field, just like in the third
operation:\n",
+ "```\n",
+ " # Transforming each word to lower case and combining it with a '1'.
Result of this step are pairs (word: 1).\n",
+ " - type: PyMap\n",
+ " name: PairWordsWith1\n",
+ " input: FindWords\n",
+ " config:\n",
+ " fn: 'lambda word: (word, 1)'\n",
+ "```\n",
+ "The `input` refers to the name of another transform, the previous one in
this case.\n",
+ "This particular operation takes `fn` (stands for function) as an
argument. Currently only Python functions are supported.\n",
+ "For more complicated functions, you can take advantage of YAML's
multiline feature, as you can see in the second operation:\n",
+ "```\n",
+ " # Using a regex, we'll split the content of the message (one long
string) into words (list of strings).\n",
+ " - type: PyFlatMap\n",
+ " name: FindWords\n",
+ " input: InputText\n",
+ " config:\n",
+ " fn: |\n",
+ " import re\n",
+ " lambda line: re.findall(r\"[a-zA-Z]+\", line)\n",
+ "```\n",
+ "Here we had to import Python's regex package - `re`. To do that, we
indicated a multiline string with `|` character and wrote the function in 2
lines.\n",
+ "\n",
+ "Let's run the pipeline executing the Python entry-point script
(`apache_beam.yaml.main`) with our pipeline file as an argument:"
+ ],
+ "metadata": {
+ "collapsed": false
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "outputs": [],
+ "source": [
+ "run('python -m apache_beam.yaml.main
--pipeline_spec_file=pipeline-01.yaml')"
Review Comment:
Hi @yardenbm,
there were some changes in the yaml api, which make this notebook outdated.
I'm updating it as we speak. It should be functional by the end of this week,
and merged soon after.
If you're looking for some examples on how to construct a pipeline, take a
look at the Readme.md
(https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml#readme)
and other .md files and unit tests in
https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]