[GitHub] [beam] davidcavazos commented on a change in pull request #14045: [BEAM-10937] Add reading and writing data notebook

GitBox Mon, 01 Mar 2021 13:38:50 -0800


davidcavazos commented on a change in pull request #14045:
URL: https://github.com/apache/beam/pull/14045#discussion_r585067892




##########
File path: examples/notebooks/tour-of-beam/reading-and-writing-data.ipynb
##########
@@ -0,0 +1,939 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "Reading and writing data -- Tour of Beam",
+      "provenance": [],
+      "collapsed_sections": [],
+      "toc_visible": true
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/tour-of-beam/reading-and-writing-data.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "cellView": "form",
+        "id": "upmJn_DjcThx"
+      },
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "execution_count": 95,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5UC_aGanx6oE"
+      },
+      "source": [
+        "# Reading and writing data -- _Tour of Beam_\n",
+        "\n",
+        "So far we've learned some of the basic transforms like\n",
+        
"[`Map`](https://beam.apache.org/documentation/transforms/python/elementwise/map)
 _(one-to-one)_,\n",
+        
"[`FlatMap`](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap)
 _(one-to-many)_,\n",
+        
"[`Filter`](https://beam.apache.org/documentation/transforms/python/elementwise/filter)
 _(one-to-zero)_,\n",
+        
"[`Combine`](https://beam.apache.org/documentation/transforms/python/aggregation/combineglobally)
 _(many-to-one)_, and\n",
+        
"[`GroupByKey`](https://beam.apache.org/documentation/transforms/python/aggregation/groupbykey).\n",
+        "These allow us to transform data in any way, but so far we've created 
data from an in-memory\n",
+        "[`iterable`](https://docs.python.org/3/glossary.html#term-iterable), 
like a `List`, using\n",
+        
"[`Create`](https://beam.apache.org/documentation/transforms/python/other/create).\n",
+        "\n",
+        "This works well for experimenting with small datasets. For larger 
datasets we use a **`Source`** transform to read data and a **`Sink`** 
transform to write data.\n",
+        "\n",
+        "Let's create some data files and see how we can read them in Beam."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "R_Yhoc6N_Flg"
+      },
+      "source": [
+        "# Install apache-beam with pip.\n",
+        "!pip install --quiet apache-beam\n",
+        "\n",
+        "# Create a directory for our data files.\n",
+        "!mkdir -p data"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "sQUUi4H9s-g2"
+      },
+      "source": [
+        "%%writefile data/my-text-file-1.txt\n",
+        "This is just a plain text file, UTF-8 strings are allowed 🎉.\n",
+        "Each line in the file is one element in the PCollection."
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "BWVVeTSOlKug"
+      },
+      "source": [
+        "%%writefile data/my-text-file-2.txt\n",
+        "There are no guarantees on the order of the elements.\n",
+        "ฅ^•ﻌ•^ฅ"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "NhCws6ncbDJG"
+      },
+      "source": [
+        "%%writefile data/penguins.csv\n",
+        
"species,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g\n",
+        
"0,0.2545454545454545,0.6666666666666666,0.15254237288135594,0.2916666666666667\n",
+        
"0,0.26909090909090905,0.5119047619047618,0.23728813559322035,0.3055555555555556\n",
+        
"1,0.5236363636363636,0.5714285714285713,0.3389830508474576,0.2222222222222222\n",
+        
"1,0.6509090909090909,0.7619047619047619,0.4067796610169492,0.3333333333333333\n",
+        "2,0.509090909090909,0.011904761904761862,0.6610169491525424,0.5\n",
+        
"2,0.6509090909090909,0.38095238095238104,0.9830508474576272,0.8333333333333334"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "_OkWHiAvpWDZ"
+      },
+      "source": [
+        "# Reading from text files\n",
+        "\n",
+        "We can use the\n",
+        
"[`ReadFromText`](https://beam.apache.org/releases/pydoc/current/apache_beam.io.textio.html#apache_beam.io.textio.ReadFromText)\n",
+        "transform to read text files into `str` elements.\n",
+        "\n",
+        "It takes a\n",
+        "[_glob 
pattern_](https://en.wikipedia.org/wiki/Glob_%28programming%29)\n",
+        "as an input, and reads all the files that match that pattern.\n",
+        "It returns one element for each line in the file.\n",
+        "\n",
+        "For example, in the pattern `data/*.txt`, the `*` is a wildcard that 
matches anything. This pattern matches all the files in the `data/` directory 
with a `.txt` extension."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "xDXdE9uysriw",
+        "outputId": "f5d58b5d-892a-4a42-89c5-b78f1d329cf3"
+      },
+      "source": [
+        "import apache_beam as beam\n",
+        "\n",
+        "file_name = 'data/*.txt'\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  (\n",
+        "      pipeline\n",
+        "      | 'Read files' >> beam.io.ReadFromText(file_name)\n",
+        "      | 'Print contents' >> beam.Map(print)\n",
+        "  )"
+      ],
+      "execution_count": 96,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "There are no guarantees on the order of the elements.\n",
+            "ฅ^•ﻌ•^ฅ\n",
+            "This is just a plain text file, UTF-8 strings are allowed 🎉.\n",
+            "Each line in the file is one element in the PCollection.\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "9-2wmzEWsdrb"
+      },
+      "source": [
+        "# Writing to text files\n",
+        "\n",
+        "We can use the\n",
+        
"[`WriteToText`](https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.io.textio.html#apache_beam.io.textio.WriteToText)
 transform to write `str` elements into text files.\n",
+        "\n",
+        "It takes a _file path prefix_ as an input, and it writes the all 
`str` elements into one or more files with filenames starting with that prefix. 
You can optionally pass a `file_name_suffix` as well, usually used for the file 
extension. Each element goes into its own line in the output files."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "nkPlfoTfz61I"
+      },
+      "source": [
+        "import apache_beam as beam\n",
+        "\n",
+        "file_name_prefix = 'outputs/file'\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  (\n",
+        "      pipeline\n",
+        "      | 'Create file lines' >> beam.Create([\n",
+        "          'Each element must be a string.',\n",
+        "          'It writes one element per line.',\n",
+        "          'There are no guarantees on the line order.',\n",
+        "          'The data might be written into multiple files.',\n",
+        "      ])\n",
+        "      | 'Write to files' >> beam.io.WriteToText(\n",
+        "          file_name_prefix,\n",
+        "          file_name_suffix='.txt')\n",
+        "  )"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "8au0yJSd1itt",
+        "outputId": "d7e72785-9fa8-4a2b-c6d0-4735aac8e206"
+      },
+      "source": [
+        "# Lets look at the output files and contents.\n",
+        "!head outputs/file*.txt"
+      ],
+      "execution_count": 98,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Each element must be a string.\n",
+            "It writes one element per line.\n",
+            "There are no guarantees on the line order.\n",
+            "The data might be written into multiple files.\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "21CCdZispqYK"
+      },
+      "source": [
+        "# Reading data\n",
+        "\n",
+        "Your data might reside in various input formats. Take a look at 
the\n",
+        "[Built-in I/O 
Transforms](https://beam.apache.org/documentation/io/built-in)\n",
+        "page for a list of all the available I/O transforms in Beam.\n",
+        "\n",
+        "If none of those work for you, you might need to create your own 
input transform.\n",
+        "\n",
+        "> ℹ️ For a more in-depth guide, take a look at the\n",
+        "[Developing a new I/O 
connector](https://beam.apache.org/documentation/io/developing-io-overview) 
page."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "7dQEym1QRG4y"
+      },
+      "source": [
+        "## Reading from an `iterable`\n",
+        "\n",
+        "The easiest way to create elements is using\n",
+        
"[`FlatMap`](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap).\n",
+        "\n",
+        "A common way is having a 
[`generator`](https://docs.python.org/3/glossary.html#term-generator) function. 
This could take an input and _expand_ it into a large amount of elements. The 
nice thing about `generator`s is that they don't have to fit everything into 
memory like a `list`, they simply\n",
+        
"[`yield`](https://docs.python.org/3/reference/simple_stmts.html#yield)\n",
+        "elements as they process them.\n",
+        "\n",
+        "For example, let's define a `generator` called `count`, that `yield`s 
the numbers from `0` to `n`. We use `Create` for the initial `n` value(s) and 
then exapand them with `FlatMap`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "wR6WY6wOMVhb",
+        "outputId": "232e9fb3-4054-4eaf-9bd0-1adc4435b220"
+      },
+      "source": [
+        "import apache_beam as beam\n",
+        "\n",
+        "def count(n):\n",
+        "  for i in range(n):\n",
+        "    yield i\n",
+        "\n",
+        "n = 5\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  (\n",
+        "      pipeline\n",
+        "      | 'Create inputs' >> beam.Create([n])\n",
+        "      | 'Generate elements' >> beam.FlatMap(count)\n",
+        "      | 'Print elements' >> beam.Map(print)\n",
+        "  )"
+      ],
+      "execution_count": 8,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "0\n",
+            "1\n",
+            "2\n",
+            "3\n",
+            "4\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "G4fw7NE1RQNf"
+      },
+      "source": [
+        "## Creating an input transform\n",
+        "\n",
+        "For a nicer interface, we could abstract the `Create` and the 
`FlatMap` into a custom `PTransform`. This would give a more intuitive way to 
use it, while hiding the inner workings.\n",
+        "\n",
+        "We create a new class that inherits from `beam.PTransform`. Any input 
from the generator function, like `n`, becomes a class field. The generator 
function itself would now become a\n",
+        
"[`staticmethod`](https://docs.python.org/3/library/functions.html#staticmethod).\n",
+        "And we can hide the `Create` and `FlatMap` in the `expand` method.\n",
+        "\n",
+        "Now we can use our transform in a more intuitive way, just like 
`ReadFromText`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "m8iXqE1CRnn5",
+        "outputId": "019f3b32-74c5-4860-edee-1c8553f200bb"
+      },
+      "source": [
+        "import apache_beam as beam\n",
+        "\n",
+        "class Count(beam.PTransform):\n",
+        "  def __init__(self, n):\n",
+        "    self.n = n\n",
+        "\n",
+        "  @staticmethod\n",
+        "  def count(n):\n",
+        "    for i in range(n):\n",
+        "      yield i\n",
+        "\n",
+        "  def expand(self, pcollection):\n",
+        "    return (\n",
+        "        pcollection\n",
+        "        | 'Create inputs' >> beam.Create([self.n])\n",
+        "        | 'Generate elements' >> beam.FlatMap(Count.count)\n",
+        "    )\n",
+        "\n",
+        "n = 5\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  (\n",
+        "      pipeline\n",
+        "      | f'Count to {n}' >> Count(n)\n",
+        "      | 'Print elements' >> beam.Map(print)\n",
+        "  )"
+      ],
+      "execution_count": 9,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "0\n",
+            "1\n",
+            "2\n",
+            "3\n",
+            "4\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "e02_vFmUg-mK"
+      },
+      "source": [
+        "## Example: Reading CSV files\n",
+        "\n",
+        "Lets say we want to read CSV files get elements as `dict`s. We like 
how `ReadFromText` expands a file pattern, but we might want to allow for 
multiple patterns as well.\n",
+        "\n",
+        "We create a `ReadCsvFiles` transform, which takes a list of 
`file_patterns` as input. It expands all the `glob` patterns, and then, for 
each file name it reads each row as a `dict` using the\n",
+        
"[`csv.DictReader`](https://docs.python.org/3/library/csv.html#csv.DictReader) 
module."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "ywVbJxegaZbo",
+        "outputId": "5e0adfa3-e685-4fe0-b6b7-bfa3d8469da1"
+      },
+      "source": [
+        "import apache_beam as beam\n",
+        "import csv\n",
+        "import glob\n",
+        "\n",
+        "class ReadCsvFiles(beam.PTransform):\n",
+        "  def __init__(self, file_patterns):\n",
+        "    self.file_patterns = file_patterns\n",
+        "\n",
+        "  @staticmethod\n",
+        "  def read_csv_lines(file_name):\n",
+        "    with open(file_name, 'r') as f:\n",
+        "      for row in csv.DictReader(f):\n",
+        "        yield dict(row)\n",
+        "\n",
+        "  def expand(self, pcollection):\n",
+        "    return (\n",
+        "        pcollection\n",
+        "        | 'Create file patterns' >> 
beam.Create(self.file_patterns)\n",
+        "        | 'Expand file patterns' >> beam.FlatMap(glob.glob)\n",
+        "        | 'Read CSV lines' >> beam.FlatMap(self.read_csv_lines)\n",
+        "    )\n",
+        "\n",
+        "file_patterns = ['data/*.csv']\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  (\n",
+        "      pipeline\n",
+        "      | 'Read CSV files' >> ReadCsvFiles(file_patterns)\n",
+        "      | 'Print elements' >> beam.Map(print)\n",
+        "  )"
+      ],
+      "execution_count": 86,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "{'species': '0', 'culmen_length_mm': '0.2545454545454545', 
'culmen_depth_mm': '0.6666666666666666', 'flipper_length_mm': 
'0.15254237288135594', 'body_mass_g': '0.2916666666666667'}\n",
+            "{'species': '0', 'culmen_length_mm': '0.26909090909090905', 
'culmen_depth_mm': '0.5119047619047618', 'flipper_length_mm': 
'0.23728813559322035', 'body_mass_g': '0.3055555555555556'}\n",
+            "{'species': '1', 'culmen_length_mm': '0.5236363636363636', 
'culmen_depth_mm': '0.5714285714285713', 'flipper_length_mm': 
'0.3389830508474576', 'body_mass_g': '0.2222222222222222'}\n",
+            "{'species': '1', 'culmen_length_mm': '0.6509090909090909', 
'culmen_depth_mm': '0.7619047619047619', 'flipper_length_mm': 
'0.4067796610169492', 'body_mass_g': '0.3333333333333333'}\n",
+            "{'species': '2', 'culmen_length_mm': '0.509090909090909', 
'culmen_depth_mm': '0.011904761904761862', 'flipper_length_mm': 
'0.6610169491525424', 'body_mass_g': '0.5'}\n",
+            "{'species': '2', 'culmen_length_mm': '0.6509090909090909', 
'culmen_depth_mm': '0.38095238095238104', 'flipper_length_mm': 
'0.9830508474576272', 'body_mass_g': '0.8333333333333334'}\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "ZyzB_RO9Vs1D"
+      },
+      "source": [
+        "## Example: Reading from a SQLite database\n",
+        "\n",
+        "Lets begin by creating a small SQLite local database file.\n",
+        "\n",
+        "Run the _\"Creating the SQLite database\"_ cell to create a new 
SQLite3 database with the filename you choose. You can double-click it to see 
the source code if you want."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "EJ58A0AoV02o",
+        "cellView": "form",
+        "outputId": "7025eb26-409d-4212-bd10-a3bccbb2679f"
+      },
+      "source": [
+        "#@title Creating the SQLite database\n",
+        "import sqlite3\n",
+        "\n",
+        "databse_file = \"moon-phases.db\" #@param {type:\"string\"}\n",
+        "\n",
+        "with sqlite3.connect(databse_file) as db:\n",
+        "  cursor = db.cursor()\n",
+        "\n",
+        "  # Create the moon_phases table.\n",
+        "  cursor.execute('''\n",
+        "    CREATE TABLE IF NOT EXISTS moon_phases (\n",
+        "      id INTEGER PRIMARY KEY,\n",
+        "      phase_emoji TEXT NOT NULL,\n",
+        "      peak_datetime DATETIME NOT NULL,\n",
+        "      phase TEXT NOT NULL)''')\n",
+        "\n",
+        "  # Truncate the table if it's already populated.\n",
+        "  cursor.execute('DELETE FROM moon_phases')\n",
+        "\n",
+        "  # Insert some sample data.\n",
+        "  insert_moon_phase = 'INSERT INTO moon_phases(phase_emoji, 
peak_datetime, phase) VALUES(?, ?, ?)'\n",
+        "  cursor.execute(insert_moon_phase, ('🌕', '2017-12-03 15:47:00', 
'Full Moon'))\n",
+        "  cursor.execute(insert_moon_phase, ('🌗', '2017-12-10 07:51:00', 
'Last Quarter'))\n",
+        "  cursor.execute(insert_moon_phase, ('🌑', '2017-12-18 06:30:00', 'New 
Moon'))\n",
+        "  cursor.execute(insert_moon_phase, ('🌓', '2017-12-26 09:20:00', 
'First Quarter'))\n",
+        "  cursor.execute(insert_moon_phase, ('🌕', '2018-01-02 02:24:00', 
'Full Moon'))\n",
+        "  cursor.execute(insert_moon_phase, ('🌗', '2018-01-08 22:25:00', 
'Last Quarter'))\n",
+        "  cursor.execute(insert_moon_phase, ('🌑', '2018-01-17 02:17:00', 'New 
Moon'))\n",
+        "  cursor.execute(insert_moon_phase, ('🌓', '2018-01-24 22:20:00', 
'First Quarter'))\n",
+        "  cursor.execute(insert_moon_phase, ('🌕', '2018-01-31 13:27:00', 
'Full Moon'))\n",
+        "\n",
+        "  # Query for the data in the table to make sure it's populated.\n",
+        "  cursor.execute('SELECT * FROM moon_phases')\n",
+        "  for row in cursor.fetchall():\n",
+        "    print(row)"
+      ],
+      "execution_count": 11,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "(1, '🌕', '2017-12-03 15:47:00', 'Full Moon')\n",
+            "(2, '🌗', '2017-12-10 07:51:00', 'Last Quarter')\n",
+            "(3, '🌑', '2017-12-18 06:30:00', 'New Moon')\n",
+            "(4, '🌓', '2017-12-26 09:20:00', 'First Quarter')\n",
+            "(5, '🌕', '2018-01-02 02:24:00', 'Full Moon')\n",
+            "(6, '🌗', '2018-01-08 22:25:00', 'Last Quarter')\n",
+            "(7, '🌑', '2018-01-17 02:17:00', 'New Moon')\n",
+            "(8, '🌓', '2018-01-24 22:20:00', 'First Quarter')\n",
+            "(9, '🌕', '2018-01-31 13:27:00', 'Full Moon')\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "8y-bRhPVWai6"
+      },
+      "source": [
+        "We could use a `FlatMap` transform to receive a SQL query and `yield` 
each result row, but that would mean creating a new database connection for 
each query. If we generated a large number of queries, creating that many 
connections could be a bottleneck.\n",
+        "\n",
+        "It would be nice to create the database connection only once for each 
worker, and every query could use the same connection if needed.\n",
+        "\n",
+        "We can use a\n",
+        "[custom `DoFn` 
transform](https://beam.apache.org/documentation/transforms/python/elementwise/pardo/#example-3-pardo-with-dofn-methods)\n",
+        "for this. It allows us to open and close resources, like the database 
connection, only _once_ per `DoFn` _instance_ by using the `setup` and 
`teardown` methods.\n",
+        "\n",
+        "> ℹ️ It should be safe to _read_ from a database with multiple 
concurrent processes using the same connection, but only one process should be 
_writing_ at once."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "id": "Bnpwqr-NV5DF",
+        "outputId": "b3cb7e46-222b-4e82-8f41-81098f54b7ab"
+      },
+      "source": [
+        "import apache_beam as beam\n",
+        "import sqlite3\n",
+        "\n",
+        "class SQLiteSelect(beam.DoFn):\n",
+        "  def __init__(self, database_file):\n",
+        "    self.database_file = database_file\n",
+        "    self.connection = None\n",
+        "\n",
+        "  def setup(self):\n",
+        "    self.connection = sqlite3.connect(self.database_file)\n",
+        "\n",
+        "  def process(self, query):\n",
+        "    table, columns = query\n",
+        "    cursor = self.connection.cursor()\n",
+        "    cursor.execute(f\"SELECT {','.join(columns)} FROM {table}\")\n",
+        "    for row in cursor.fetchall():\n",
+        "      yield dict(zip(columns, row))\n",
+        "\n",
+        "  def teardown(self):\n",
+        "    self.connection.close()\n",
+        "\n",
+        "class SelectFromSQLite(beam.PTransform):\n",
+        "  def __init__(self, database_file, queries):\n",
+        "    self.database_file = database_file\n",
+        "    self.queries = queries\n",
+        "\n",
+        "  def expand(self, pcollection):\n",
+        "    return (\n",
+        "        pcollection\n",
+        "        | 'Create None' >> beam.Create(queries)\n",
+        "        | 'SQLite SELECT' >> 
beam.ParDo(SQLiteSelect(self.database_file))\n",
+        "    )\n",
+        "\n",
+        "database_file = 'moon-phases.db'\n",
+        "queries = [\n",
+        "    # (table_name, [column1, column2, ...])\n",
+        "    ('moon_phases', ['phase_emoji', 'peak_datetime', 'phase']),\n",
+        "    ('moon_phases', ['phase_emoji', 'phase']),\n",
+        "]\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  (\n",
+        "      pipeline\n",
+        "      | 'Read from SQLite' >> SelectFromSQLite(database_file, 
queries)\n",
+        "      | 'Print rows' >> beam.Map(print)\n",
+        "  )"
+      ],
+      "execution_count": 12,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "{'phase_emoji': '🌕', 'peak_datetime': '2017-12-03 15:47:00', 
'phase': 'Full Moon'}\n",
+            "{'phase_emoji': '🌗', 'peak_datetime': '2017-12-10 07:51:00', 
'phase': 'Last Quarter'}\n",
+            "{'phase_emoji': '🌑', 'peak_datetime': '2017-12-18 06:30:00', 
'phase': 'New Moon'}\n",
+            "{'phase_emoji': '🌓', 'peak_datetime': '2017-12-26 09:20:00', 
'phase': 'First Quarter'}\n",
+            "{'phase_emoji': '🌕', 'peak_datetime': '2018-01-02 02:24:00', 
'phase': 'Full Moon'}\n",
+            "{'phase_emoji': '🌗', 'peak_datetime': '2018-01-08 22:25:00', 
'phase': 'Last Quarter'}\n",
+            "{'phase_emoji': '🌑', 'peak_datetime': '2018-01-17 02:17:00', 
'phase': 'New Moon'}\n",
+            "{'phase_emoji': '🌓', 'peak_datetime': '2018-01-24 22:20:00', 
'phase': 'First Quarter'}\n",
+            "{'phase_emoji': '🌕', 'peak_datetime': '2018-01-31 13:27:00', 
'phase': 'Full Moon'}\n",
+            "{'phase_emoji': '🌕', 'phase': 'Full Moon'}\n",
+            "{'phase_emoji': '🌗', 'phase': 'Last Quarter'}\n",
+            "{'phase_emoji': '🌑', 'phase': 'New Moon'}\n",
+            "{'phase_emoji': '🌓', 'phase': 'First Quarter'}\n",
+            "{'phase_emoji': '🌕', 'phase': 'Full Moon'}\n",
+            "{'phase_emoji': '🌗', 'phase': 'Last Quarter'}\n",
+            "{'phase_emoji': '🌑', 'phase': 'New Moon'}\n",
+            "{'phase_emoji': '🌓', 'phase': 'First Quarter'}\n",
+            "{'phase_emoji': '🌕', 'phase': 'Full Moon'}\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "C5Mx_pfNpu_q"
+      },
+      "source": [
+        "# Writing data\n",
+        "\n",
+        "Your might want to write your data in various output formats. Take a 
look at the\n",
+        "[Built-in I/O 
Transforms](https://beam.apache.org/documentation/io/built-in)\n",
+        "page for a list of all the available I/O transforms in Beam.\n",
+        "\n",
+        "If none of those work for you, you might need to create your own 
output transform.\n",
+        "\n",
+        "> ℹ️ For a more in-depth guide, take a look at the\n",
+        "[Developing a new I/O 
connector](https://beam.apache.org/documentation/io/developing-io-overview) 
page."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "FpM368NEhc-q"
+      },
+      "source": [
+        "## Creating an output transform\n",
+        "\n",
+        "The most straightforward way to write data would be to use a `Map` 
transform to write each element into our desired output format. In most cases, 
however, this would result in a lot of overhead creating, connecting to, and/or 
deleting resources.\n",
+        "\n",
+        "Most data services are optimized to load _batches_ of elements at a 
time. This only has to connect to the service once, and it can the load many 
elements at a time.\n",

Review comment:
       Sure, that makes sense. Thanks!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] davidcavazos commented on a change in pull request #14045: [BEAM-10937] Add reading and writing data notebook

Reply via email to