rosetn commented on a change in pull request #14045:
URL: https://github.com/apache/beam/pull/14045#discussion_r586082067
##########
File path: examples/notebooks/tour-of-beam/reading-and-writing-data.ipynb
##########
@@ -0,0 +1,939 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "Reading and writing data -- Tour of Beam",
Review comment:
General note about this notebook: consider using text formatting more
sparingly.
* If you're using italics to define terms, don't italicize words that you're
not trying to define or explain. I'm actually a little confused about why some
terms are italicized; it could be possible that you're trying to make a new
user aware of important terms. However, pulling these terms out in a bulleted
list to highlight them or naming your headers in a way to direct attention to
the term might be a better choice.
* On a similar note with bolding terms, I'm not sure why Source and Sink are
bolded.
* Here are some general guidelines:
https://developers.google.com/style/text-formatting. We don't need to follow
them strictly, but we should be deliberate in our choices.
##########
File path: examples/notebooks/tour-of-beam/reading-and-writing-data.ipynb
##########
@@ -0,0 +1,939 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "Reading and writing data -- Tour of Beam",
+ "provenance": [],
+ "collapsed_sections": [],
+ "toc_visible": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "<a
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/tour-of-beam/reading-and-writing-data.ipynb\"
target=\"_parent\"><img
src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In
Colab\"/></a>"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "cellView": "form",
+ "id": "upmJn_DjcThx"
+ },
+ "source": [
+ "#@title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License."
+ ],
+ "execution_count": 95,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5UC_aGanx6oE"
+ },
+ "source": [
+ "# Reading and writing data -- _Tour of Beam_\n",
+ "\n",
+ "So far we've learned some of the basic transforms like\n",
+
"[`Map`](https://beam.apache.org/documentation/transforms/python/elementwise/map)
_(one-to-one)_,\n",
+
"[`FlatMap`](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap)
_(one-to-many)_,\n",
+
"[`Filter`](https://beam.apache.org/documentation/transforms/python/elementwise/filter)
_(one-to-zero)_,\n",
+
"[`Combine`](https://beam.apache.org/documentation/transforms/python/aggregation/combineglobally)
_(many-to-one)_, and\n",
+
"[`GroupByKey`](https://beam.apache.org/documentation/transforms/python/aggregation/groupbykey).\n",
+ "These allow us to transform data in any way, but so far we've created
data from an in-memory\n",
+ "[`iterable`](https://docs.python.org/3/glossary.html#term-iterable),
like a `List`, using\n",
+
"[`Create`](https://beam.apache.org/documentation/transforms/python/other/create).\n",
+ "\n",
+ "This works well for experimenting with small datasets. For larger
datasets we use a **`Source`** transform to read data and a **`Sink`**
transform to write data.\n",
+ "\n",
+ "Let's create some data files and see how we can read them in Beam."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "R_Yhoc6N_Flg"
+ },
+ "source": [
+ "# Install apache-beam with pip.\n",
+ "!pip install --quiet apache-beam\n",
+ "\n",
+ "# Create a directory for our data files.\n",
+ "!mkdir -p data"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "sQUUi4H9s-g2"
+ },
+ "source": [
+ "%%writefile data/my-text-file-1.txt\n",
+ "This is just a plain text file, UTF-8 strings are allowed 🎉.\n",
+ "Each line in the file is one element in the PCollection."
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "BWVVeTSOlKug"
+ },
+ "source": [
+ "%%writefile data/my-text-file-2.txt\n",
+ "There are no guarantees on the order of the elements.\n",
+ "ฅ^•ﻌ•^ฅ"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "NhCws6ncbDJG"
+ },
+ "source": [
+ "%%writefile data/penguins.csv\n",
+
"species,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g\n",
+
"0,0.2545454545454545,0.6666666666666666,0.15254237288135594,0.2916666666666667\n",
+
"0,0.26909090909090905,0.5119047619047618,0.23728813559322035,0.3055555555555556\n",
+
"1,0.5236363636363636,0.5714285714285713,0.3389830508474576,0.2222222222222222\n",
+
"1,0.6509090909090909,0.7619047619047619,0.4067796610169492,0.3333333333333333\n",
+ "2,0.509090909090909,0.011904761904761862,0.6610169491525424,0.5\n",
+
"2,0.6509090909090909,0.38095238095238104,0.9830508474576272,0.8333333333333334"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_OkWHiAvpWDZ"
+ },
+ "source": [
+ "# Reading from text files\n",
+ "\n",
+ "We can use the\n",
+
"[`ReadFromText`](https://beam.apache.org/releases/pydoc/current/apache_beam.io.textio.html#apache_beam.io.textio.ReadFromText)\n",
+ "transform to read text files into `str` elements.\n",
+ "\n",
+ "It takes a\n",
+ "[_glob
pattern_](https://en.wikipedia.org/wiki/Glob_%28programming%29)\n",
+ "as an input, and reads all the files that match that pattern.\n",
+ "It returns one element for each line in the file.\n",
+ "\n",
+ "For example, in the pattern `data/*.txt`, the `*` is a wildcard that
matches anything. This pattern matches all the files in the `data/` directory
with a `.txt` extension."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "xDXdE9uysriw",
+ "outputId": "f5d58b5d-892a-4a42-89c5-b78f1d329cf3"
+ },
+ "source": [
+ "import apache_beam as beam\n",
+ "\n",
+ "input_files = 'data/*.txt'\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " (\n",
+ " pipeline\n",
+ " | 'Read files' >> beam.io.ReadFromText(input_files)\n",
+ " | 'Print contents' >> beam.Map(print)\n",
+ " )"
+ ],
+ "execution_count": 96,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "There are no guarantees on the order of the elements.\n",
+ "ฅ^•ﻌ•^ฅ\n",
+ "This is just a plain text file, UTF-8 strings are allowed 🎉.\n",
+ "Each line in the file is one element in the PCollection.\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9-2wmzEWsdrb"
+ },
+ "source": [
+ "# Writing to text files\n",
+ "\n",
+ "We can use the\n",
+
"[`WriteToText`](https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.io.textio.html#apache_beam.io.textio.WriteToText)
transform to write `str` elements into text files.\n",
+ "\n",
+ "It takes a _file path prefix_ as an input, and it writes the all
`str` elements into one or more files with filenames starting with that prefix.
You can optionally pass a `file_name_suffix` as well, usually used for the file
extension. Each element goes into its own line in the output files."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "nkPlfoTfz61I"
+ },
+ "source": [
+ "import apache_beam as beam\n",
+ "\n",
+ "output_file_name_prefix = 'outputs/file'\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " (\n",
+ " pipeline\n",
+ " | 'Create file lines' >> beam.Create([\n",
+ " 'Each element must be a string.',\n",
+ " 'It writes one element per line.',\n",
+ " 'There are no guarantees on the line order.',\n",
+ " 'The data might be written into multiple files.',\n",
+ " ])\n",
+ " | 'Write to files' >> beam.io.WriteToText(\n",
+ " output_file_name_prefix,\n",
+ " file_name_suffix='.txt')\n",
+ " )"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "8au0yJSd1itt",
+ "outputId": "d7e72785-9fa8-4a2b-c6d0-4735aac8e206"
+ },
+ "source": [
+ "# Lets look at the output files and contents.\n",
+ "!head outputs/file*.txt"
+ ],
+ "execution_count": 98,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "Each element must be a string.\n",
+ "It writes one element per line.\n",
+ "There are no guarantees on the line order.\n",
+ "The data might be written into multiple files.\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "21CCdZispqYK"
+ },
+ "source": [
+ "# Reading data\n",
+ "\n",
+ "Your data might reside in various input formats. Take a look at
the\n",
+ "[Built-in I/O
Transforms](https://beam.apache.org/documentation/io/built-in)\n",
+ "page for a list of all the available I/O transforms in Beam.\n",
+ "\n",
+ "If none of those work for you, you might need to create your own
input transform.\n",
+ "\n",
+ "> ℹ️ For a more in-depth guide, take a look at the\n",
+ "[Developing a new I/O
connector](https://beam.apache.org/documentation/io/developing-io-overview)
page."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7dQEym1QRG4y"
+ },
+ "source": [
+ "## Reading from an `iterable`\n",
+ "\n",
+ "The easiest way to create elements is using\n",
+
"[`FlatMap`](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap).\n",
+ "\n",
+ "A common way is having a
[`generator`](https://docs.python.org/3/glossary.html#term-generator) function.
This could take an input and _expand_ it into a large amount of elements. The
nice thing about `generator`s is that they don't have to fit everything into
memory like a `list`, they simply\n",
+
"[`yield`](https://docs.python.org/3/reference/simple_stmts.html#yield)\n",
+ "elements as they process them.\n",
+ "\n",
+ "For example, let's define a `generator` called `count`, that `yield`s
the numbers from `0` to `n`. We use `Create` for the initial `n` value(s) and
then exapand them with `FlatMap`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "wR6WY6wOMVhb",
+ "outputId": "232e9fb3-4054-4eaf-9bd0-1adc4435b220"
+ },
+ "source": [
+ "import apache_beam as beam\n",
+ "\n",
+ "def count(n):\n",
+ " for i in range(n):\n",
+ " yield i\n",
+ "\n",
+ "n = 5\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " (\n",
+ " pipeline\n",
+ " | 'Create inputs' >> beam.Create([n])\n",
+ " | 'Generate elements' >> beam.FlatMap(count)\n",
+ " | 'Print elements' >> beam.Map(print)\n",
+ " )"
+ ],
+ "execution_count": 8,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "0\n",
+ "1\n",
+ "2\n",
+ "3\n",
+ "4\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "G4fw7NE1RQNf"
+ },
+ "source": [
+ "## Creating an input transform\n",
+ "\n",
+ "For a nicer interface, we could abstract the `Create` and the
`FlatMap` into a custom `PTransform`. This would give a more intuitive way to
use it, while hiding the inner workings.\n",
+ "\n",
+ "We create a new class that inherits from `beam.PTransform`. Any input
from the generator function, like `n`, becomes a class field. The generator
function itself would now become a\n",
+
"[`staticmethod`](https://docs.python.org/3/library/functions.html#staticmethod).\n",
+ "And we can hide the `Create` and `FlatMap` in the `expand` method.\n",
+ "\n",
+ "Now we can use our transform in a more intuitive way, just like
`ReadFromText`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "m8iXqE1CRnn5",
+ "outputId": "019f3b32-74c5-4860-edee-1c8553f200bb"
+ },
+ "source": [
+ "import apache_beam as beam\n",
+ "\n",
+ "class Count(beam.PTransform):\n",
+ " def __init__(self, n):\n",
+ " self.n = n\n",
+ "\n",
+ " @staticmethod\n",
+ " def count(n):\n",
+ " for i in range(n):\n",
+ " yield i\n",
+ "\n",
+ " def expand(self, pcollection):\n",
+ " return (\n",
+ " pcollection\n",
+ " | 'Create inputs' >> beam.Create([self.n])\n",
+ " | 'Generate elements' >> beam.FlatMap(Count.count)\n",
+ " )\n",
+ "\n",
+ "n = 5\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " (\n",
+ " pipeline\n",
+ " | f'Count to {n}' >> Count(n)\n",
+ " | 'Print elements' >> beam.Map(print)\n",
+ " )"
+ ],
+ "execution_count": 9,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "0\n",
+ "1\n",
+ "2\n",
+ "3\n",
+ "4\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "e02_vFmUg-mK"
+ },
+ "source": [
+ "## Example: Reading CSV files\n",
+ "\n",
+ "Lets say we want to read CSV files get elements as `dict`s. We like
how `ReadFromText` expands a file pattern, but we might want to allow for
multiple patterns as well.\n",
Review comment:
I think you're missing a "to" in this sentence
##########
File path: website/www/site/content/en/get-started/tour-of-beam.md
##########
@@ -30,9 +30,18 @@ You can also [try an Apache Beam
pipeline](/get-started/try-apache-beam) using t
### Learn the basics
In this notebook we go through the basics of what is Apache Beam and how to
get started.
+We learn what is a data _pipeline_, a _PCollection_, a _PTransform_, as well
as some basic transforms like `Map`, `FlatMap`, `Filter`, `Combine`, and
`GroupByKey`.
{{< button-colab
url="https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/tour-of-beam/getting-started.ipynb"
>}}
+### Reading and writing data
+
+Here we go through some examples on how to read and write data to and from
different data formats.
Review comment:
I'd replace "here" with "In this notebook"
##########
File path: website/www/site/content/en/get-started/tour-of-beam.md
##########
@@ -30,9 +30,18 @@ You can also [try an Apache Beam
pipeline](/get-started/try-apache-beam) using t
### Learn the basics
In this notebook we go through the basics of what is Apache Beam and how to
get started.
+We learn what is a data _pipeline_, a _PCollection_, a _PTransform_, as well
as some basic transforms like `Map`, `FlatMap`, `Filter`, `Combine`, and
`GroupByKey`.
Review comment:
We learn about data pipelines, PCollections, and PTransforms, as well as
some basic transforms like `Map`, `FlatMap`, `Filter`, `Combine`, and
`GroupByKey`.
Or keep them in italics.
##########
File path: examples/notebooks/tour-of-beam/reading-and-writing-data.ipynb
##########
@@ -0,0 +1,939 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "Reading and writing data -- Tour of Beam",
+ "provenance": [],
+ "collapsed_sections": [],
+ "toc_visible": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "<a
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/tour-of-beam/reading-and-writing-data.ipynb\"
target=\"_parent\"><img
src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In
Colab\"/></a>"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "cellView": "form",
+ "id": "upmJn_DjcThx"
+ },
+ "source": [
+ "#@title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License."
+ ],
+ "execution_count": 95,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5UC_aGanx6oE"
+ },
+ "source": [
+ "# Reading and writing data -- _Tour of Beam_\n",
+ "\n",
+ "So far we've learned some of the basic transforms like\n",
+
"[`Map`](https://beam.apache.org/documentation/transforms/python/elementwise/map)
_(one-to-one)_,\n",
+
"[`FlatMap`](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap)
_(one-to-many)_,\n",
+
"[`Filter`](https://beam.apache.org/documentation/transforms/python/elementwise/filter)
_(one-to-zero)_,\n",
+
"[`Combine`](https://beam.apache.org/documentation/transforms/python/aggregation/combineglobally)
_(many-to-one)_, and\n",
+
"[`GroupByKey`](https://beam.apache.org/documentation/transforms/python/aggregation/groupbykey).\n",
+ "These allow us to transform data in any way, but so far we've created
data from an in-memory\n",
+ "[`iterable`](https://docs.python.org/3/glossary.html#term-iterable),
like a `List`, using\n",
+
"[`Create`](https://beam.apache.org/documentation/transforms/python/other/create).\n",
+ "\n",
+ "This works well for experimenting with small datasets. For larger
datasets we use a **`Source`** transform to read data and a **`Sink`**
transform to write data.\n",
+ "\n",
+ "Let's create some data files and see how we can read them in Beam."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "R_Yhoc6N_Flg"
+ },
+ "source": [
+ "# Install apache-beam with pip.\n",
+ "!pip install --quiet apache-beam\n",
+ "\n",
+ "# Create a directory for our data files.\n",
+ "!mkdir -p data"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "sQUUi4H9s-g2"
+ },
+ "source": [
+ "%%writefile data/my-text-file-1.txt\n",
+ "This is just a plain text file, UTF-8 strings are allowed 🎉.\n",
+ "Each line in the file is one element in the PCollection."
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "BWVVeTSOlKug"
+ },
+ "source": [
+ "%%writefile data/my-text-file-2.txt\n",
+ "There are no guarantees on the order of the elements.\n",
+ "ฅ^•ﻌ•^ฅ"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "NhCws6ncbDJG"
+ },
+ "source": [
+ "%%writefile data/penguins.csv\n",
+
"species,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g\n",
+
"0,0.2545454545454545,0.6666666666666666,0.15254237288135594,0.2916666666666667\n",
+
"0,0.26909090909090905,0.5119047619047618,0.23728813559322035,0.3055555555555556\n",
+
"1,0.5236363636363636,0.5714285714285713,0.3389830508474576,0.2222222222222222\n",
+
"1,0.6509090909090909,0.7619047619047619,0.4067796610169492,0.3333333333333333\n",
+ "2,0.509090909090909,0.011904761904761862,0.6610169491525424,0.5\n",
+
"2,0.6509090909090909,0.38095238095238104,0.9830508474576272,0.8333333333333334"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_OkWHiAvpWDZ"
+ },
+ "source": [
+ "# Reading from text files\n",
+ "\n",
+ "We can use the\n",
+
"[`ReadFromText`](https://beam.apache.org/releases/pydoc/current/apache_beam.io.textio.html#apache_beam.io.textio.ReadFromText)\n",
+ "transform to read text files into `str` elements.\n",
+ "\n",
+ "It takes a\n",
+ "[_glob
pattern_](https://en.wikipedia.org/wiki/Glob_%28programming%29)\n",
+ "as an input, and reads all the files that match that pattern.\n",
+ "It returns one element for each line in the file.\n",
+ "\n",
+ "For example, in the pattern `data/*.txt`, the `*` is a wildcard that
matches anything. This pattern matches all the files in the `data/` directory
with a `.txt` extension."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "xDXdE9uysriw",
+ "outputId": "f5d58b5d-892a-4a42-89c5-b78f1d329cf3"
+ },
+ "source": [
+ "import apache_beam as beam\n",
+ "\n",
+ "input_files = 'data/*.txt'\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " (\n",
+ " pipeline\n",
+ " | 'Read files' >> beam.io.ReadFromText(input_files)\n",
+ " | 'Print contents' >> beam.Map(print)\n",
+ " )"
+ ],
+ "execution_count": 96,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "There are no guarantees on the order of the elements.\n",
+ "ฅ^•ﻌ•^ฅ\n",
+ "This is just a plain text file, UTF-8 strings are allowed 🎉.\n",
+ "Each line in the file is one element in the PCollection.\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9-2wmzEWsdrb"
+ },
+ "source": [
+ "# Writing to text files\n",
+ "\n",
+ "We can use the\n",
+
"[`WriteToText`](https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.io.textio.html#apache_beam.io.textio.WriteToText)
transform to write `str` elements into text files.\n",
+ "\n",
+ "It takes a _file path prefix_ as an input, and it writes the all
`str` elements into one or more files with filenames starting with that prefix.
You can optionally pass a `file_name_suffix` as well, usually used for the file
extension. Each element goes into its own line in the output files."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "nkPlfoTfz61I"
+ },
+ "source": [
+ "import apache_beam as beam\n",
+ "\n",
+ "output_file_name_prefix = 'outputs/file'\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " (\n",
+ " pipeline\n",
+ " | 'Create file lines' >> beam.Create([\n",
+ " 'Each element must be a string.',\n",
+ " 'It writes one element per line.',\n",
+ " 'There are no guarantees on the line order.',\n",
+ " 'The data might be written into multiple files.',\n",
+ " ])\n",
+ " | 'Write to files' >> beam.io.WriteToText(\n",
+ " output_file_name_prefix,\n",
+ " file_name_suffix='.txt')\n",
+ " )"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "8au0yJSd1itt",
+ "outputId": "d7e72785-9fa8-4a2b-c6d0-4735aac8e206"
+ },
+ "source": [
+ "# Lets look at the output files and contents.\n",
+ "!head outputs/file*.txt"
+ ],
+ "execution_count": 98,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "Each element must be a string.\n",
+ "It writes one element per line.\n",
+ "There are no guarantees on the line order.\n",
+ "The data might be written into multiple files.\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "21CCdZispqYK"
+ },
+ "source": [
+ "# Reading data\n",
+ "\n",
+ "Your data might reside in various input formats. Take a look at
the\n",
+ "[Built-in I/O
Transforms](https://beam.apache.org/documentation/io/built-in)\n",
+ "page for a list of all the available I/O transforms in Beam.\n",
+ "\n",
+ "If none of those work for you, you might need to create your own
input transform.\n",
+ "\n",
+ "> ℹ️ For a more in-depth guide, take a look at the\n",
+ "[Developing a new I/O
connector](https://beam.apache.org/documentation/io/developing-io-overview)
page."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7dQEym1QRG4y"
+ },
+ "source": [
+ "## Reading from an `iterable`\n",
+ "\n",
+ "The easiest way to create elements is using\n",
+
"[`FlatMap`](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap).\n",
+ "\n",
+ "A common way is having a
[`generator`](https://docs.python.org/3/glossary.html#term-generator) function.
This could take an input and _expand_ it into a large amount of elements. The
nice thing about `generator`s is that they don't have to fit everything into
memory like a `list`, they simply\n",
+
"[`yield`](https://docs.python.org/3/reference/simple_stmts.html#yield)\n",
+ "elements as they process them.\n",
+ "\n",
+ "For example, let's define a `generator` called `count`, that `yield`s
the numbers from `0` to `n`. We use `Create` for the initial `n` value(s) and
then exapand them with `FlatMap`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "wR6WY6wOMVhb",
+ "outputId": "232e9fb3-4054-4eaf-9bd0-1adc4435b220"
+ },
+ "source": [
+ "import apache_beam as beam\n",
+ "\n",
+ "def count(n):\n",
+ " for i in range(n):\n",
+ " yield i\n",
+ "\n",
+ "n = 5\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " (\n",
+ " pipeline\n",
+ " | 'Create inputs' >> beam.Create([n])\n",
+ " | 'Generate elements' >> beam.FlatMap(count)\n",
+ " | 'Print elements' >> beam.Map(print)\n",
+ " )"
+ ],
+ "execution_count": 8,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "0\n",
+ "1\n",
+ "2\n",
+ "3\n",
+ "4\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "G4fw7NE1RQNf"
+ },
+ "source": [
+ "## Creating an input transform\n",
+ "\n",
+ "For a nicer interface, we could abstract the `Create` and the
`FlatMap` into a custom `PTransform`. This would give a more intuitive way to
use it, while hiding the inner workings.\n",
+ "\n",
+ "We create a new class that inherits from `beam.PTransform`. Any input
from the generator function, like `n`, becomes a class field. The generator
function itself would now become a\n",
+
"[`staticmethod`](https://docs.python.org/3/library/functions.html#staticmethod).\n",
+ "And we can hide the `Create` and `FlatMap` in the `expand` method.\n",
+ "\n",
+ "Now we can use our transform in a more intuitive way, just like
`ReadFromText`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "m8iXqE1CRnn5",
+ "outputId": "019f3b32-74c5-4860-edee-1c8553f200bb"
+ },
+ "source": [
+ "import apache_beam as beam\n",
+ "\n",
+ "class Count(beam.PTransform):\n",
+ " def __init__(self, n):\n",
+ " self.n = n\n",
+ "\n",
+ " @staticmethod\n",
+ " def count(n):\n",
+ " for i in range(n):\n",
+ " yield i\n",
+ "\n",
+ " def expand(self, pcollection):\n",
+ " return (\n",
+ " pcollection\n",
+ " | 'Create inputs' >> beam.Create([self.n])\n",
+ " | 'Generate elements' >> beam.FlatMap(Count.count)\n",
+ " )\n",
+ "\n",
+ "n = 5\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " (\n",
+ " pipeline\n",
+ " | f'Count to {n}' >> Count(n)\n",
+ " | 'Print elements' >> beam.Map(print)\n",
+ " )"
+ ],
+ "execution_count": 9,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "0\n",
+ "1\n",
+ "2\n",
+ "3\n",
+ "4\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "e02_vFmUg-mK"
+ },
+ "source": [
+ "## Example: Reading CSV files\n",
+ "\n",
+ "Lets say we want to read CSV files get elements as `dict`s. We like
how `ReadFromText` expands a file pattern, but we might want to allow for
multiple patterns as well.\n",
Review comment:
Replace `dict`s with "dictionary objects", "`dict` objects", or "Python
dictionaries"
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]