[GitHub] [beam] liferoad commented on a diff in pull request #26185: add one windowing example

via GitHub Mon, 24 Apr 2023 14:32:48 -0700


liferoad commented on code in PR #26185:
URL: https://github.com/apache/beam/pull/26185#discussion_r1175801699



##########
examples/notebooks/get-started/learn_beam_windowing_by_doing.ipynb:
##########
@@ -0,0 +1,1880 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "view-in-github"
+      },
+      "source": [
+        "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/learn_beam_windowing_by_doing.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "cellView": "form",
+        "id": "L7ZbRufePd2g"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        " # **Introduction to Windowing in Apache Beam**\n",
+        "\n",
+        "In this notebook, we will learn the fundamentals of **batch and 
stream data processing** as we walk through a few introductory examples in 
Beam. The pipelines in these examples process real-world data for air quality 
levels in India between 2017 and 2022.\n",
+        "\n",
+        "After this tutorial you should have a basic understanding of the 
following:\n",
+        "\n",
+        "*   What is **batch vs. stream** data processing?\n",
+        "*   How can I use Beam to run a **simple batch analysis job**?\n",
+        "*   How can I use Beam's **windowing features** to process only 
certain intervals of data at a time?"
+      ],
+      "metadata": {
+        "id": "83TJhNxLD7-W"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "To begin, run the following cell to set up Apache Beam."
+      ],
+      "metadata": {
+        "id": "Dj3ftRRqfumW"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Install apache-beam.\n",
+        "!pip install --quiet apache-beam"
+      ],
+      "metadata": {
+        "id": "zmJ0pCmSvD0-",
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "outputId": "9041f637-12a0-4f78-f60b-ebd3c3a1c186"
+      },
+      "execution_count": 1,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "\u001b[2K     
\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.5/14.5 
MB\u001b[0m \u001b[31m53.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[2K     
\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m 
\u001b[32m144.1/144.1 kB\u001b[0m \u001b[31m1.1 MB/s\u001b[0m eta 
\u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[2K     
\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m89.7/89.7 
kB\u001b[0m \u001b[31m7.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[?25h  Preparing metadata (setup.py) ... 
\u001b[?25l\u001b[?25hdone\n",
+            "\u001b[2K     
\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m515.5/515.5 
kB\u001b[0m \u001b[31m12.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[2K     
\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.6/2.6 
MB\u001b[0m \u001b[31m22.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[2K     
\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m152.0/152.0 
kB\u001b[0m \u001b[31m10.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[?25h  Preparing metadata (setup.py) ... 
\u001b[?25l\u001b[?25hdone\n",
+            "\u001b[2K     
\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.7/2.7 
MB\u001b[0m \u001b[31m15.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[?25h  Preparing metadata (setup.py) ... 
\u001b[?25l\u001b[?25hdone\n",
+            "  Building wheel for crcmod (setup.py) ... 
\u001b[?25l\u001b[?25hdone\n",
+            "  Building wheel for dill (setup.py) ... 
\u001b[?25l\u001b[?25hdone\n",
+            "  Building wheel for docopt (setup.py) ... 
\u001b[?25l\u001b[?25hdone\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Set the logging level to reduce verbose information\n",
+        "import logging\n",
+        "\n",
+        "logging.root.setLevel(logging.ERROR)"
+      ],
+      "metadata": {
+        "id": "7sBoLahzPlJ1"
+      },
+      "execution_count": 2,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "<hr style=\"border: 5px solid #003262;\" />\n",
+        "<hr style=\"border: 1px solid #fdb515;\" />\n",
+        "\n",
+        "## Batch vs. Stream Data Processing\n",
+        "\n",
+        "What's the difference?\n",
+        "\n",
+        "**Batch processing** is when data processing and analysis happens on 
a set of data that have already been stored over a period of time. In other 
words, the input is a finite, bounded data set. An example is payroll and 
billing systems that have to be processed weekly or monthly.\n",
+        "\n",
+        "**Stream processing** happens *as* data flows through a system. This 
results in analysis and reporting of events within a short period of time or 
near real-time on an infinite, unbounded data set. An example would be fraud 
detection or intrusion detection.\n",
+        "\n",
+        "> This tutorial will focus on **batch processing** examples. To learn 
more about stream processing in Beam, check out 
[this](https://beam.apache.org/documentation/sdks/python-streaming/)."
+      ],
+      "metadata": {
+        "id": "BB6FAwYj1dAi"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "<hr style=\"border: 5px solid #003262;\" />\n",
+        "\n",
+        "## Load the Data\n",
+        "\n",
+        "Let's import the example data we will be using throughout this 
tutorial. The 
[dataset](https://www.kaggle.com/datasets/fedesoriano/air-quality-data-in-india)
 consists of **hourly air quality data (PM 2.5) in India from November 2017 to 
June 2022**.\n",
+        "\n",
+        "> The World Health Organization (WHO) reports 7 million premature 
deaths linked to air pollution each year. In India alone, more than 90% of the 
country's population live in areas where air quality is below the WHO's 
standards.\n",
+        "\n",
+        "**What does the data look like?**\n",
+        "\n",
+        "The data set has 36,192 rows and 6 columns in total recording the 
following attributes:\n",
+        "\n",
+        "1.   `Timestamp`: Timestamp in the format YYYY-MM-DD HH:MM:SS\n",
+        "2.   `Year`: Year of the measure\n",
+        "3.   `Month`: Month of the measure\n",
+        "4.   `Day`: Day of the measure\n",
+        "5.   `Hour`: Hour of the measure\n",
+        "6.   `PM2.5`: Fine particulate matter air pollutant level in air\n",
+        "\n",
+        "**For our purposes, we will focus on only the first and last column 
of the** `air_quality` **DataFrame**:\n",
+        "\n",
+        "1.   `Timestamp`: Timestamp in the format YYYY-MM-DD HH:MM:SS\n",
+        "2.   `PM 2.5`: Fine particulate matter air pollutant level in air\n",
+        "\n",
+        "Run the following cell to load the data into our file directory."
+      ],
+      "metadata": {
+        "id": "W_63UtsoBRql"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "id": "GTteBUZ-7e2s",
+        "colab": {
+          "base_uri": "https://localhost:8080/";
+        },
+        "outputId": "3af9cdb0-c248-4c6d-96f6-c3739fb66014"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Copying gs://batch-processing-example/air-quality-india.csv...\n",
+            "/ [1 files][  1.4 MiB/  1.4 MiB]                                  
              \n",
+            "Operation completed over 1 objects/1.4 MiB.                       
               \n"
+          ]
+        }
+      ],
+      "source": [
+        "# Copy the dataset file into the local file system from Google Cloud 
Storage.\n",
+        "!mkdir -p data\n",
+        "!gsutil cp gs://batch-processing-example/air-quality-india.csv data/"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "#### Data Preparation\n",
+        "\n",
+        "Before we load the data into a Beam pipeline, let's use Pandas to 
select two columns."
+      ],
+      "metadata": {
+        "id": "1NcmPl7C43lY"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Load the data into a Python Pandas DataFrame.\n",

Review Comment:
   well, Beam Dataframe already has a good notebook: 
https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/tour-of-beam/dataframes.ipynb.
 I am not sure whether it covers all the operations for this example.



##########
examples/notebooks/get-started/learn_beam_windowing_by_doing.ipynb:
##########
@@ -0,0 +1,1880 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "view-in-github"
+      },
+      "source": [
+        "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/learn_beam_windowing_by_doing.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), 
Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "cellView": "form",
+        "id": "L7ZbRufePd2g"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        " # **Introduction to Windowing in Apache Beam**\n",

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] liferoad commented on a diff in pull request #26185: add one windowing example

Reply via email to