davidcavazos commented on a change in pull request #15470:
URL: https://github.com/apache/beam/pull/15470#discussion_r704522951
##########
File path: examples/notebooks/tour-of-beam/dataframes.ipynb
##########
@@ -0,0 +1,820 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "Beam DataFrames",
+ "provenance": [],
+ "collapsed_sections": [],
+ "toc_visible": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "code",
+ "metadata": {
+ "cellView": "form",
+ "id": "rz2qIC9IL2rI"
+ },
+ "source": [
+ "#@title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License."
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hDuXLLSZnI1D"
+ },
+ "source": [
+ "# Beam DataFrames\n",
+ "\n",
+ "<button>\n",
+ " <a
href=\"https://beam.apache.org/documentation/dsls/dataframes/overview/\">\n",
+ " <img src=\"https://beam.apache.org/images/favicon.ico\"
alt=\"Open the docs\" height=\"16\"/>\n",
+ " Beam DataFrames overview\n",
+ " </a>\n",
+ "</button>\n",
+ "\n",
+ "Beam DataFrames provide a pandas-like
[DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)\n",
+ "API to declare Beam pipelines.\n",
+ "\n",
+ "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+ "[Beam DataFrames
overview](https://beam.apache.org/documentation/dsls/dataframes/overview)
page.\n",
+ "\n",
+ "First, we need to install Apache Beam with the `interactive` extra
for the Interactive runner."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8QVByaWjkarZ"
+ },
+ "source": [
+ "%pip install --quiet apache-beam[interactive]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "aLqdbX4Mgipq"
+ },
+ "source": [
+ "Lets create a small data file of\n",
+ "[Comma-Separated Values
(CSV)](https://en.wikipedia.org/wiki/Comma-separated_values).\n",
+ "It simply includes the dates of the\n",
+ "[equinoxes](https://en.wikipedia.org/wiki/Equinox) and\n",
+ "[solstices](https://en.wikipedia.org/wiki/Solstice)\n",
+ "of the year 2021."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "hZjwAm7qotrJ"
+ },
+ "source": [
+ "%%writefile solar_events.csv\n",
+ "timestamp,event\n",
+ "2021-03-20 09:37:00,March Equinox\n",
+ "2021-06-21 03:32:00,June Solstice\n",
+ "2021-09-22 19:21:00,September Equinox\n",
+ "2021-12-21 15:59:00,December Solstice"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Hv_58JulleQ_"
+ },
+ "source": [
+ "# Interactive Beam\n",
+ "\n",
+ "Pandas has the\n",
+
"[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+ "function to easily read CSV files into DataFrames.\n",
+ "Beam has the\n",
+
"[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+ "function that emulates `pandas.read_csv`, but returns us a deferred
Beam DataFrame.\n",
+ "\n",
+ "If you’re using\n",
+ "[Interactive
Beam](https://beam.apache.org/releases/pydoc/current/apache_beam.runners.interactive.interactive_beam.html),\n",
+ "you can use `collect` to bring a Beam DataFrame into local memory as
a Pandas DataFrame."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 242
+ },
+ "id": "sKAMXD5ElhYP",
+ "outputId": "928d9ad7-ae75-42d7-8dc6-8c5afd730b11"
+ },
+ "source": [
+ "import apache_beam as beam\n",
+ "import apache_beam.runners.interactive.interactive_beam as ib\n",
+ "from apache_beam.runners.interactive.interactive_runner import
InteractiveRunner\n",
+ "\n",
+ "pipeline = beam.Pipeline(InteractiveRunner())\n",
+ "\n",
+ "# Create a deferred Beam DataFrame with the contents of our csv
file.\n",
+ "beam_df = pipeline | 'Read CSV' >>
beam.dataframe.io.read_csv('solar_events.csv')\n",
+ "\n",
+ "# We can use `ib.collect` to view the contents of a Beam
DataFrame.\n",
+ "ib.collect(beam_df)"
+ ],
+ "execution_count": 3,
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "application/javascript": [
+ "\n",
+ " if (typeof window.interactive_beam_jquery ==
'undefined') {\n",
+ " var jqueryScript =
document.createElement('script');\n",
+ " jqueryScript.src =
'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+ " jqueryScript.type = 'text/javascript';\n",
+ " jqueryScript.onload = function() {\n",
+ " var datatableScript =
document.createElement('script');\n",
+ " datatableScript.src =
'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+ " datatableScript.type = 'text/javascript';\n",
+ " datatableScript.onload = function() {\n",
+ " window.interactive_beam_jquery =
jQuery.noConflict(true);\n",
+ "
window.interactive_beam_jquery(document).ready(function($){\n",
+ " \n",
+ " });\n",
+ " }\n",
+ " document.head.appendChild(datatableScript);\n",
+ " };\n",
+ " document.head.appendChild(jqueryScript);\n",
+ " } else {\n",
+ "
window.interactive_beam_jquery(document).ready(function($){\n",
+ " \n",
+ " });\n",
+ " }"
+ ]
+ },
+ "metadata": {}
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/html": [
+ "\n",
+ " <link rel=\"stylesheet\"
href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\"
integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\"
crossorigin=\"anonymous\">\n",
+ " <div
id=\"progress_indicator_1516f4062e4fc6d4e58f33cf44c41c1d\"
class=\"spinner-border text-info\" role=\"status\">\n",
+ " </div>"
+ ],
+ "text/plain": [
+ "<IPython.core.display.HTML object>"
+ ]
+ },
+ "metadata": {}
+ },
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "WARNING:root:Make sure that locally built Python SDK docker image
has Python 3.7 interpreter.\n"
+ ]
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "application/javascript": [
+ "\n",
+ " if (typeof window.interactive_beam_jquery ==
'undefined') {\n",
+ " var jqueryScript =
document.createElement('script');\n",
+ " jqueryScript.src =
'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+ " jqueryScript.type = 'text/javascript';\n",
+ " jqueryScript.onload = function() {\n",
+ " var datatableScript =
document.createElement('script');\n",
+ " datatableScript.src =
'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+ " datatableScript.type = 'text/javascript';\n",
+ " datatableScript.onload = function() {\n",
+ " window.interactive_beam_jquery =
jQuery.noConflict(true);\n",
+ "
window.interactive_beam_jquery(document).ready(function($){\n",
+ " \n",
+ "
$(\"#progress_indicator_1516f4062e4fc6d4e58f33cf44c41c1d\").remove();\n",
+ " });\n",
+ " }\n",
+ " document.head.appendChild(datatableScript);\n",
+ " };\n",
+ " document.head.appendChild(jqueryScript);\n",
+ " } else {\n",
+ "
window.interactive_beam_jquery(document).ready(function($){\n",
+ " \n",
+ "
$(\"#progress_indicator_1516f4062e4fc6d4e58f33cf44c41c1d\").remove();\n",
+ " });\n",
+ " }"
+ ]
+ },
+ "metadata": {}
+ },
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>timestamp</th>\n",
+ " <th>event</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>solar_events.csv:0</th>\n",
+ " <td>2021-03-20 09:37:00</td>\n",
+ " <td>March Equinox</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>solar_events.csv:1</th>\n",
+ " <td>2021-06-21 03:32:00</td>\n",
+ " <td>June Solstice</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>solar_events.csv:2</th>\n",
+ " <td>2021-09-22 19:21:00</td>\n",
+ " <td>September Equinox</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>solar_events.csv:3</th>\n",
+ " <td>2021-12-21 15:59:00</td>\n",
+ " <td>December Solstice</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " timestamp event\n",
+ "solar_events.csv:0 2021-03-20 09:37:00 March Equinox\n",
+ "solar_events.csv:1 2021-06-21 03:32:00 June Solstice\n",
+ "solar_events.csv:2 2021-09-22 19:21:00 September Equinox\n",
+ "solar_events.csv:3 2021-12-21 15:59:00 December Solstice"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 3
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "t3Is6dArtN_Z"
+ },
+ "source": [
+ "Collecting a Beam DataFrame into a Pandas DataFrame is useful to
perform\n",
+ "[operations not supported by Beam
DataFrames](https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas#classes-of-unsupported-operations).\n",
+ "\n",
+ "For example, lets say we want to take only the first two events in
chronological order.\n",
Review comment:
Good catch! Done
##########
File path: examples/notebooks/tour-of-beam/dataframes.ipynb
##########
@@ -0,0 +1,820 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "Beam DataFrames",
+ "provenance": [],
+ "collapsed_sections": [],
+ "toc_visible": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "code",
+ "metadata": {
+ "cellView": "form",
+ "id": "rz2qIC9IL2rI"
+ },
+ "source": [
+ "#@title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License."
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hDuXLLSZnI1D"
+ },
+ "source": [
+ "# Beam DataFrames\n",
+ "\n",
+ "<button>\n",
+ " <a
href=\"https://beam.apache.org/documentation/dsls/dataframes/overview/\">\n",
+ " <img src=\"https://beam.apache.org/images/favicon.ico\"
alt=\"Open the docs\" height=\"16\"/>\n",
+ " Beam DataFrames overview\n",
+ " </a>\n",
+ "</button>\n",
+ "\n",
+ "Beam DataFrames provide a pandas-like
[DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)\n",
+ "API to declare Beam pipelines.\n",
+ "\n",
+ "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+ "[Beam DataFrames
overview](https://beam.apache.org/documentation/dsls/dataframes/overview)
page.\n",
+ "\n",
+ "First, we need to install Apache Beam with the `interactive` extra
for the Interactive runner."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8QVByaWjkarZ"
+ },
+ "source": [
+ "%pip install --quiet apache-beam[interactive]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "aLqdbX4Mgipq"
+ },
+ "source": [
+ "Lets create a small data file of\n",
+ "[Comma-Separated Values
(CSV)](https://en.wikipedia.org/wiki/Comma-separated_values).\n",
+ "It simply includes the dates of the\n",
+ "[equinoxes](https://en.wikipedia.org/wiki/Equinox) and\n",
+ "[solstices](https://en.wikipedia.org/wiki/Solstice)\n",
+ "of the year 2021."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "hZjwAm7qotrJ"
+ },
+ "source": [
+ "%%writefile solar_events.csv\n",
+ "timestamp,event\n",
+ "2021-03-20 09:37:00,March Equinox\n",
+ "2021-06-21 03:32:00,June Solstice\n",
+ "2021-09-22 19:21:00,September Equinox\n",
+ "2021-12-21 15:59:00,December Solstice"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Hv_58JulleQ_"
+ },
+ "source": [
+ "# Interactive Beam\n",
+ "\n",
+ "Pandas has the\n",
+
"[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+ "function to easily read CSV files into DataFrames.\n",
+ "Beam has the\n",
+
"[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+ "function that emulates `pandas.read_csv`, but returns us a deferred
Beam DataFrame.\n",
Review comment:
Thanks, done
##########
File path: examples/notebooks/tour-of-beam/dataframes.ipynb
##########
@@ -0,0 +1,820 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "Beam DataFrames",
+ "provenance": [],
+ "collapsed_sections": [],
+ "toc_visible": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "code",
+ "metadata": {
+ "cellView": "form",
+ "id": "rz2qIC9IL2rI"
+ },
+ "source": [
+ "#@title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License."
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hDuXLLSZnI1D"
+ },
+ "source": [
+ "# Beam DataFrames\n",
+ "\n",
+ "<button>\n",
+ " <a
href=\"https://beam.apache.org/documentation/dsls/dataframes/overview/\">\n",
+ " <img src=\"https://beam.apache.org/images/favicon.ico\"
alt=\"Open the docs\" height=\"16\"/>\n",
+ " Beam DataFrames overview\n",
+ " </a>\n",
+ "</button>\n",
+ "\n",
+ "Beam DataFrames provide a pandas-like
[DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)\n",
+ "API to declare Beam pipelines.\n",
+ "\n",
+ "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+ "[Beam DataFrames
overview](https://beam.apache.org/documentation/dsls/dataframes/overview)
page.\n",
+ "\n",
+ "First, we need to install Apache Beam with the `interactive` extra
for the Interactive runner."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8QVByaWjkarZ"
+ },
+ "source": [
+ "%pip install --quiet apache-beam[interactive]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "aLqdbX4Mgipq"
+ },
+ "source": [
+ "Lets create a small data file of\n",
+ "[Comma-Separated Values
(CSV)](https://en.wikipedia.org/wiki/Comma-separated_values).\n",
+ "It simply includes the dates of the\n",
+ "[equinoxes](https://en.wikipedia.org/wiki/Equinox) and\n",
+ "[solstices](https://en.wikipedia.org/wiki/Solstice)\n",
+ "of the year 2021."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "hZjwAm7qotrJ"
+ },
+ "source": [
+ "%%writefile solar_events.csv\n",
+ "timestamp,event\n",
+ "2021-03-20 09:37:00,March Equinox\n",
+ "2021-06-21 03:32:00,June Solstice\n",
+ "2021-09-22 19:21:00,September Equinox\n",
+ "2021-12-21 15:59:00,December Solstice"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Hv_58JulleQ_"
+ },
+ "source": [
+ "# Interactive Beam\n",
+ "\n",
+ "Pandas has the\n",
+
"[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+ "function to easily read CSV files into DataFrames.\n",
+ "Beam has the\n",
+
"[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+ "function that emulates `pandas.read_csv`, but returns us a deferred
Beam DataFrame.\n",
+ "\n",
+ "If you’re using\n",
+ "[Interactive
Beam](https://beam.apache.org/releases/pydoc/current/apache_beam.runners.interactive.interactive_beam.html),\n",
+ "you can use `collect` to bring a Beam DataFrame into local memory as
a Pandas DataFrame."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 242
+ },
+ "id": "sKAMXD5ElhYP",
+ "outputId": "928d9ad7-ae75-42d7-8dc6-8c5afd730b11"
+ },
+ "source": [
+ "import apache_beam as beam\n",
+ "import apache_beam.runners.interactive.interactive_beam as ib\n",
+ "from apache_beam.runners.interactive.interactive_runner import
InteractiveRunner\n",
+ "\n",
+ "pipeline = beam.Pipeline(InteractiveRunner())\n",
+ "\n",
+ "# Create a deferred Beam DataFrame with the contents of our csv
file.\n",
+ "beam_df = pipeline | 'Read CSV' >>
beam.dataframe.io.read_csv('solar_events.csv')\n",
+ "\n",
+ "# We can use `ib.collect` to view the contents of a Beam
DataFrame.\n",
+ "ib.collect(beam_df)"
+ ],
+ "execution_count": 3,
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "application/javascript": [
+ "\n",
+ " if (typeof window.interactive_beam_jquery ==
'undefined') {\n",
+ " var jqueryScript =
document.createElement('script');\n",
+ " jqueryScript.src =
'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+ " jqueryScript.type = 'text/javascript';\n",
+ " jqueryScript.onload = function() {\n",
+ " var datatableScript =
document.createElement('script');\n",
+ " datatableScript.src =
'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+ " datatableScript.type = 'text/javascript';\n",
+ " datatableScript.onload = function() {\n",
+ " window.interactive_beam_jquery =
jQuery.noConflict(true);\n",
+ "
window.interactive_beam_jquery(document).ready(function($){\n",
+ " \n",
+ " });\n",
+ " }\n",
+ " document.head.appendChild(datatableScript);\n",
+ " };\n",
+ " document.head.appendChild(jqueryScript);\n",
+ " } else {\n",
+ "
window.interactive_beam_jquery(document).ready(function($){\n",
+ " \n",
+ " });\n",
+ " }"
+ ]
+ },
+ "metadata": {}
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/html": [
+ "\n",
+ " <link rel=\"stylesheet\"
href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\"
integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\"
crossorigin=\"anonymous\">\n",
+ " <div
id=\"progress_indicator_1516f4062e4fc6d4e58f33cf44c41c1d\"
class=\"spinner-border text-info\" role=\"status\">\n",
+ " </div>"
+ ],
+ "text/plain": [
+ "<IPython.core.display.HTML object>"
+ ]
+ },
+ "metadata": {}
+ },
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "WARNING:root:Make sure that locally built Python SDK docker image
has Python 3.7 interpreter.\n"
+ ]
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "application/javascript": [
+ "\n",
+ " if (typeof window.interactive_beam_jquery ==
'undefined') {\n",
+ " var jqueryScript =
document.createElement('script');\n",
+ " jqueryScript.src =
'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+ " jqueryScript.type = 'text/javascript';\n",
+ " jqueryScript.onload = function() {\n",
+ " var datatableScript =
document.createElement('script');\n",
+ " datatableScript.src =
'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+ " datatableScript.type = 'text/javascript';\n",
+ " datatableScript.onload = function() {\n",
+ " window.interactive_beam_jquery =
jQuery.noConflict(true);\n",
+ "
window.interactive_beam_jquery(document).ready(function($){\n",
+ " \n",
+ "
$(\"#progress_indicator_1516f4062e4fc6d4e58f33cf44c41c1d\").remove();\n",
+ " });\n",
+ " }\n",
+ " document.head.appendChild(datatableScript);\n",
+ " };\n",
+ " document.head.appendChild(jqueryScript);\n",
+ " } else {\n",
+ "
window.interactive_beam_jquery(document).ready(function($){\n",
+ " \n",
+ "
$(\"#progress_indicator_1516f4062e4fc6d4e58f33cf44c41c1d\").remove();\n",
+ " });\n",
+ " }"
+ ]
+ },
+ "metadata": {}
+ },
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>timestamp</th>\n",
+ " <th>event</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>solar_events.csv:0</th>\n",
+ " <td>2021-03-20 09:37:00</td>\n",
+ " <td>March Equinox</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>solar_events.csv:1</th>\n",
+ " <td>2021-06-21 03:32:00</td>\n",
+ " <td>June Solstice</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>solar_events.csv:2</th>\n",
+ " <td>2021-09-22 19:21:00</td>\n",
+ " <td>September Equinox</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>solar_events.csv:3</th>\n",
+ " <td>2021-12-21 15:59:00</td>\n",
+ " <td>December Solstice</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " timestamp event\n",
+ "solar_events.csv:0 2021-03-20 09:37:00 March Equinox\n",
+ "solar_events.csv:1 2021-06-21 03:32:00 June Solstice\n",
+ "solar_events.csv:2 2021-09-22 19:21:00 September Equinox\n",
+ "solar_events.csv:3 2021-12-21 15:59:00 December Solstice"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 3
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "t3Is6dArtN_Z"
+ },
+ "source": [
+ "Collecting a Beam DataFrame into a Pandas DataFrame is useful to
perform\n",
+ "[operations not supported by Beam
DataFrames](https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas#classes-of-unsupported-operations).\n",
+ "\n",
+ "For example, lets say we want to take only the first two events in
chronological order.\n",
+ "Since a deferred Beam DataFrame does not have any ordering
guarantees,\n",
+ "first we need to sort the values.\n",
+ "In Pandas, we could first\n",
+
"[`df.sort_values(by='timestamp')`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)
and then\n",
+
"[`df.head(2)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html)
to achieve this.\n",
+ "\n",
+ "However, these are\n",
+ "[order-sensitive
operations](https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas#order-sensitive-operations)\n",
+ "so using them in a Beam DataFrame raises a\n",
+
"[`WontImplementError`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).\n",
+ "We can work around this by using `collect` to convert the Beam
DataFrame into a Pandas DataFrame."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 138
+ },
+ "id": "8haEu6_9iTi7",
+ "outputId": "a1e07bdc-c66d-45e5-efff-90b93219c648"
+ },
+ "source": [
+ "import apache_beam.runners.interactive.interactive_beam as ib\n",
+ "\n",
+ "# Collect the Beam DataFrame into a Pandas DataFrame.\n",
+ "df = ib.collect(beam_df)\n",
+ "\n",
+ "# We can now use any Pandas transforms with our data.\n",
+ "df.sort_values(by='timestamp').head(2)"
+ ],
+ "execution_count": 4,
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/html": [
+ "\n",
+ " <link rel=\"stylesheet\"
href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\"
integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\"
crossorigin=\"anonymous\">\n",
+ " <div
id=\"progress_indicator_4486e01c01f75e7a68a4a5fefa9ecd2c\"
class=\"spinner-border text-info\" role=\"status\">\n",
+ " </div>"
+ ],
+ "text/plain": [
+ "<IPython.core.display.HTML object>"
+ ]
+ },
+ "metadata": {}
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "application/javascript": [
+ "\n",
+ " if (typeof window.interactive_beam_jquery ==
'undefined') {\n",
+ " var jqueryScript =
document.createElement('script');\n",
+ " jqueryScript.src =
'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+ " jqueryScript.type = 'text/javascript';\n",
+ " jqueryScript.onload = function() {\n",
+ " var datatableScript =
document.createElement('script');\n",
+ " datatableScript.src =
'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+ " datatableScript.type = 'text/javascript';\n",
+ " datatableScript.onload = function() {\n",
+ " window.interactive_beam_jquery =
jQuery.noConflict(true);\n",
+ "
window.interactive_beam_jquery(document).ready(function($){\n",
+ " \n",
+ "
$(\"#progress_indicator_4486e01c01f75e7a68a4a5fefa9ecd2c\").remove();\n",
+ " });\n",
+ " }\n",
+ " document.head.appendChild(datatableScript);\n",
+ " };\n",
+ " document.head.appendChild(jqueryScript);\n",
+ " } else {\n",
+ "
window.interactive_beam_jquery(document).ready(function($){\n",
+ " \n",
+ "
$(\"#progress_indicator_4486e01c01f75e7a68a4a5fefa9ecd2c\").remove();\n",
+ " });\n",
+ " }"
+ ]
+ },
+ "metadata": {}
+ },
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>timestamp</th>\n",
+ " <th>event</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>solar_events.csv:0</th>\n",
+ " <td>2021-03-20 09:37:00</td>\n",
+ " <td>March Equinox</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>solar_events.csv:1</th>\n",
+ " <td>2021-06-21 03:32:00</td>\n",
+ " <td>June Solstice</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " timestamp event\n",
+ "solar_events.csv:0 2021-03-20 09:37:00 March Equinox\n",
+ "solar_events.csv:1 2021-06-21 03:32:00 June Solstice"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 4
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZkthQ13pwpm0"
+ },
+ "source": [
+ "> ℹ️ Note that `collect` is _only_ accessible if you’re using\n",
+ "[Interactive
Beam](https://beam.apache.org/releases/pydoc/current/apache_beam.runners.interactive.interactive_beam.html)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ujRm4K0iP8SX"
+ },
+ "source": [
+ "# Beam DataFrames to PCollections\n",
+ "\n",
+ "If you have your data as a Beam DataFrame, you can convert it into a
regular PCollection with\n",
+
"[`to_pcollection`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection).\n",
+ "\n",
+ "Converting a Beam DataFrame into a PCollection gives us each element
as a\n",
+
"[`namedtuple`](https://docs.python.org/3/library/collections.html#collections.namedtuple)\n",
+ "instance.\n",
+ "This allows us to easily access each property, for example
`element.event` and `element.timestamp`.\n",
+ "\n",
+ "Sometimes it's more convenient to convert the named tuples to Python
dictionaries,\n",
Review comment:
Thank you, I greatly appreciate this, I'm always looking forward to
improving my writing :)
##########
File path: examples/notebooks/tour-of-beam/dataframes.ipynb
##########
@@ -0,0 +1,820 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "Beam DataFrames",
+ "provenance": [],
+ "collapsed_sections": [],
+ "toc_visible": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "code",
+ "metadata": {
+ "cellView": "form",
+ "id": "rz2qIC9IL2rI"
+ },
+ "source": [
+ "#@title ###### Licensed to the Apache Software Foundation (ASF),
Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License."
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hDuXLLSZnI1D"
+ },
+ "source": [
+ "# Beam DataFrames\n",
+ "\n",
+ "<button>\n",
+ " <a
href=\"https://beam.apache.org/documentation/dsls/dataframes/overview/\">\n",
+ " <img src=\"https://beam.apache.org/images/favicon.ico\"
alt=\"Open the docs\" height=\"16\"/>\n",
+ " Beam DataFrames overview\n",
+ " </a>\n",
+ "</button>\n",
+ "\n",
+ "Beam DataFrames provide a pandas-like
[DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)\n",
+ "API to declare Beam pipelines.\n",
+ "\n",
+ "> ℹ️ To learn more about Beam DataFrames, take a look at the\n",
+ "[Beam DataFrames
overview](https://beam.apache.org/documentation/dsls/dataframes/overview)
page.\n",
+ "\n",
+ "First, we need to install Apache Beam with the `interactive` extra
for the Interactive runner."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "8QVByaWjkarZ"
+ },
+ "source": [
+ "%pip install --quiet apache-beam[interactive]"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "aLqdbX4Mgipq"
+ },
+ "source": [
+ "Lets create a small data file of\n",
+ "[Comma-Separated Values
(CSV)](https://en.wikipedia.org/wiki/Comma-separated_values).\n",
+ "It simply includes the dates of the\n",
+ "[equinoxes](https://en.wikipedia.org/wiki/Equinox) and\n",
+ "[solstices](https://en.wikipedia.org/wiki/Solstice)\n",
+ "of the year 2021."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "hZjwAm7qotrJ"
+ },
+ "source": [
+ "%%writefile solar_events.csv\n",
+ "timestamp,event\n",
+ "2021-03-20 09:37:00,March Equinox\n",
+ "2021-06-21 03:32:00,June Solstice\n",
+ "2021-09-22 19:21:00,September Equinox\n",
+ "2021-12-21 15:59:00,December Solstice"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Hv_58JulleQ_"
+ },
+ "source": [
+ "# Interactive Beam\n",
+ "\n",
+ "Pandas has the\n",
+
"[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n",
+ "function to easily read CSV files into DataFrames.\n",
+ "Beam has the\n",
+
"[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n",
+ "function that emulates `pandas.read_csv`, but returns us a deferred
Beam DataFrame.\n",
+ "\n",
+ "If you’re using\n",
+ "[Interactive
Beam](https://beam.apache.org/releases/pydoc/current/apache_beam.runners.interactive.interactive_beam.html),\n",
+ "you can use `collect` to bring a Beam DataFrame into local memory as
a Pandas DataFrame."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 242
+ },
+ "id": "sKAMXD5ElhYP",
+ "outputId": "928d9ad7-ae75-42d7-8dc6-8c5afd730b11"
+ },
+ "source": [
+ "import apache_beam as beam\n",
+ "import apache_beam.runners.interactive.interactive_beam as ib\n",
+ "from apache_beam.runners.interactive.interactive_runner import
InteractiveRunner\n",
+ "\n",
+ "pipeline = beam.Pipeline(InteractiveRunner())\n",
+ "\n",
+ "# Create a deferred Beam DataFrame with the contents of our csv
file.\n",
+ "beam_df = pipeline | 'Read CSV' >>
beam.dataframe.io.read_csv('solar_events.csv')\n",
+ "\n",
+ "# We can use `ib.collect` to view the contents of a Beam
DataFrame.\n",
+ "ib.collect(beam_df)"
+ ],
+ "execution_count": 3,
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "application/javascript": [
+ "\n",
+ " if (typeof window.interactive_beam_jquery ==
'undefined') {\n",
+ " var jqueryScript =
document.createElement('script');\n",
+ " jqueryScript.src =
'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+ " jqueryScript.type = 'text/javascript';\n",
+ " jqueryScript.onload = function() {\n",
+ " var datatableScript =
document.createElement('script');\n",
+ " datatableScript.src =
'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+ " datatableScript.type = 'text/javascript';\n",
+ " datatableScript.onload = function() {\n",
+ " window.interactive_beam_jquery =
jQuery.noConflict(true);\n",
+ "
window.interactive_beam_jquery(document).ready(function($){\n",
+ " \n",
+ " });\n",
+ " }\n",
+ " document.head.appendChild(datatableScript);\n",
+ " };\n",
+ " document.head.appendChild(jqueryScript);\n",
+ " } else {\n",
+ "
window.interactive_beam_jquery(document).ready(function($){\n",
+ " \n",
+ " });\n",
+ " }"
+ ]
+ },
+ "metadata": {}
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/html": [
+ "\n",
+ " <link rel=\"stylesheet\"
href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\"
integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\"
crossorigin=\"anonymous\">\n",
+ " <div
id=\"progress_indicator_1516f4062e4fc6d4e58f33cf44c41c1d\"
class=\"spinner-border text-info\" role=\"status\">\n",
+ " </div>"
+ ],
+ "text/plain": [
+ "<IPython.core.display.HTML object>"
+ ]
+ },
+ "metadata": {}
+ },
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "WARNING:root:Make sure that locally built Python SDK docker image
has Python 3.7 interpreter.\n"
+ ]
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "application/javascript": [
+ "\n",
+ " if (typeof window.interactive_beam_jquery ==
'undefined') {\n",
+ " var jqueryScript =
document.createElement('script');\n",
+ " jqueryScript.src =
'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+ " jqueryScript.type = 'text/javascript';\n",
+ " jqueryScript.onload = function() {\n",
+ " var datatableScript =
document.createElement('script');\n",
+ " datatableScript.src =
'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+ " datatableScript.type = 'text/javascript';\n",
+ " datatableScript.onload = function() {\n",
+ " window.interactive_beam_jquery =
jQuery.noConflict(true);\n",
+ "
window.interactive_beam_jquery(document).ready(function($){\n",
+ " \n",
+ "
$(\"#progress_indicator_1516f4062e4fc6d4e58f33cf44c41c1d\").remove();\n",
+ " });\n",
+ " }\n",
+ " document.head.appendChild(datatableScript);\n",
+ " };\n",
+ " document.head.appendChild(jqueryScript);\n",
+ " } else {\n",
+ "
window.interactive_beam_jquery(document).ready(function($){\n",
+ " \n",
+ "
$(\"#progress_indicator_1516f4062e4fc6d4e58f33cf44c41c1d\").remove();\n",
+ " });\n",
+ " }"
+ ]
+ },
+ "metadata": {}
+ },
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>timestamp</th>\n",
+ " <th>event</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>solar_events.csv:0</th>\n",
+ " <td>2021-03-20 09:37:00</td>\n",
+ " <td>March Equinox</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>solar_events.csv:1</th>\n",
+ " <td>2021-06-21 03:32:00</td>\n",
+ " <td>June Solstice</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>solar_events.csv:2</th>\n",
+ " <td>2021-09-22 19:21:00</td>\n",
+ " <td>September Equinox</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>solar_events.csv:3</th>\n",
+ " <td>2021-12-21 15:59:00</td>\n",
+ " <td>December Solstice</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " timestamp event\n",
+ "solar_events.csv:0 2021-03-20 09:37:00 March Equinox\n",
+ "solar_events.csv:1 2021-06-21 03:32:00 June Solstice\n",
+ "solar_events.csv:2 2021-09-22 19:21:00 September Equinox\n",
+ "solar_events.csv:3 2021-12-21 15:59:00 December Solstice"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 3
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "t3Is6dArtN_Z"
+ },
+ "source": [
+ "Collecting a Beam DataFrame into a Pandas DataFrame is useful to
perform\n",
+ "[operations not supported by Beam
DataFrames](https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas#classes-of-unsupported-operations).\n",
+ "\n",
+ "For example, lets say we want to take only the first two events in
chronological order.\n",
+ "Since a deferred Beam DataFrame does not have any ordering
guarantees,\n",
+ "first we need to sort the values.\n",
+ "In Pandas, we could first\n",
+
"[`df.sort_values(by='timestamp')`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)
and then\n",
+
"[`df.head(2)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html)
to achieve this.\n",
+ "\n",
+ "However, these are\n",
+ "[order-sensitive
operations](https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas#order-sensitive-operations)\n",
+ "so using them in a Beam DataFrame raises a\n",
+
"[`WontImplementError`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).\n",
+ "We can work around this by using `collect` to convert the Beam
DataFrame into a Pandas DataFrame."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 138
+ },
+ "id": "8haEu6_9iTi7",
+ "outputId": "a1e07bdc-c66d-45e5-efff-90b93219c648"
+ },
+ "source": [
+ "import apache_beam.runners.interactive.interactive_beam as ib\n",
+ "\n",
+ "# Collect the Beam DataFrame into a Pandas DataFrame.\n",
+ "df = ib.collect(beam_df)\n",
+ "\n",
+ "# We can now use any Pandas transforms with our data.\n",
+ "df.sort_values(by='timestamp').head(2)"
+ ],
+ "execution_count": 4,
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/html": [
+ "\n",
+ " <link rel=\"stylesheet\"
href=\"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css\"
integrity=\"sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh\"
crossorigin=\"anonymous\">\n",
+ " <div
id=\"progress_indicator_4486e01c01f75e7a68a4a5fefa9ecd2c\"
class=\"spinner-border text-info\" role=\"status\">\n",
+ " </div>"
+ ],
+ "text/plain": [
+ "<IPython.core.display.HTML object>"
+ ]
+ },
+ "metadata": {}
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "application/javascript": [
+ "\n",
+ " if (typeof window.interactive_beam_jquery ==
'undefined') {\n",
+ " var jqueryScript =
document.createElement('script');\n",
+ " jqueryScript.src =
'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n",
+ " jqueryScript.type = 'text/javascript';\n",
+ " jqueryScript.onload = function() {\n",
+ " var datatableScript =
document.createElement('script');\n",
+ " datatableScript.src =
'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n",
+ " datatableScript.type = 'text/javascript';\n",
+ " datatableScript.onload = function() {\n",
+ " window.interactive_beam_jquery =
jQuery.noConflict(true);\n",
+ "
window.interactive_beam_jquery(document).ready(function($){\n",
+ " \n",
+ "
$(\"#progress_indicator_4486e01c01f75e7a68a4a5fefa9ecd2c\").remove();\n",
+ " });\n",
+ " }\n",
+ " document.head.appendChild(datatableScript);\n",
+ " };\n",
+ " document.head.appendChild(jqueryScript);\n",
+ " } else {\n",
+ "
window.interactive_beam_jquery(document).ready(function($){\n",
+ " \n",
+ "
$(\"#progress_indicator_4486e01c01f75e7a68a4a5fefa9ecd2c\").remove();\n",
+ " });\n",
+ " }"
+ ]
+ },
+ "metadata": {}
+ },
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>timestamp</th>\n",
+ " <th>event</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>solar_events.csv:0</th>\n",
+ " <td>2021-03-20 09:37:00</td>\n",
+ " <td>March Equinox</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>solar_events.csv:1</th>\n",
+ " <td>2021-06-21 03:32:00</td>\n",
+ " <td>June Solstice</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " timestamp event\n",
+ "solar_events.csv:0 2021-03-20 09:37:00 March Equinox\n",
+ "solar_events.csv:1 2021-06-21 03:32:00 June Solstice"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 4
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZkthQ13pwpm0"
+ },
+ "source": [
+ "> ℹ️ Note that `collect` is _only_ accessible if you’re using\n",
+ "[Interactive
Beam](https://beam.apache.org/releases/pydoc/current/apache_beam.runners.interactive.interactive_beam.html)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ujRm4K0iP8SX"
+ },
+ "source": [
+ "# Beam DataFrames to PCollections\n",
+ "\n",
+ "If you have your data as a Beam DataFrame, you can convert it into a
regular PCollection with\n",
+
"[`to_pcollection`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection).\n",
+ "\n",
+ "Converting a Beam DataFrame into a PCollection gives us each element
as a\n",
+
"[`namedtuple`](https://docs.python.org/3/library/collections.html#collections.namedtuple)\n",
+ "instance.\n",
+ "This allows us to easily access each property, for example
`element.event` and `element.timestamp`.\n",
+ "\n",
+ "Sometimes it's more convenient to convert the named tuples to Python
dictionaries,\n",
+ "we can do that with the\n",
+
"[`_asdict`](https://docs.python.org/3/library/collections.html#collections.somenamedtuple._asdict)\n",
+ "method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "g22op8rZPvB3",
+ "outputId": "bba88b0b-4d19-4d61-dac7-2c168998a2e4"
+ },
+ "source": [
+ "import apache_beam as beam\n",
+ "from apache_beam.dataframe import convert\n",
+ "\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " beam_df = pipeline | 'Read CSV' >>
beam.dataframe.io.read_csv('solar_events.csv')\n",
+ "\n",
+ " (\n",
+ " # Convert the Beam DataFrame to a PCollection.\n",
+ " convert.to_pcollection(beam_df)\n",
+ "\n",
+ " # We get named tuples, we can convert them to dictionaries like
this.\n",
+ " | 'To dictionaries' >> beam.Map(lambda x: dict(x._asdict()))\n",
+ "\n",
+ " # Print the elements in the PCollection.\n",
+ " | 'Print' >> beam.Map(print)\n",
+ " )"
+ ],
+ "execution_count": 5,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "WARNING:root:Make sure that locally built Python SDK docker image
has Python 3.7 interpreter.\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "{'timestamp': '2021-03-20 09:37:00', 'event': 'March Equinox'}\n",
+ "{'timestamp': '2021-06-21 03:32:00', 'event': 'June Solstice'}\n",
+ "{'timestamp': '2021-09-22 19:21:00', 'event': 'September
Equinox'}\n",
+ "{'timestamp': '2021-12-21 15:59:00', 'event': 'December
Solstice'}\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "t6xNIO0iPwtn"
+ },
+ "source": [
+ "# Pandas DataFrames to PCollections\n",
+ "\n",
+ "If you have your data as a Pandas DataFrame, you can convert it into
a regular PCollection with\n",
+
"[`to_pcollection`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection).\n",
+ "\n",
+ "Since Pandas DataFrames are not part of any Beam pipeline, we must
provide the `pipeline` explicitly."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "YWYVFkvFuksz",
+ "outputId": "a3e3e6fa-85ce-4891-95a0-389fba4461a6"
+ },
+ "source": [
+ "import pandas as pd\n",
+ "import apache_beam as beam\n",
+ "from apache_beam.dataframe import convert\n",
+ "\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " df = pd.read_csv('solar_events.csv')\n",
+ "\n",
+ " (\n",
+ " # Convert the Pandas DataFrame to a PCollection.\n",
+ " convert.to_pcollection(df, pipeline=pipeline)\n",
+ "\n",
+ " # We get named tuples, we can convert them to dictionaries like
this.\n",
+ " | 'To dictionaries' >> beam.Map(lambda x: dict(x._asdict()))\n",
+ "\n",
+ " # Print the elements in the PCollection.\n",
+ " | 'Print' >> beam.Map(print)\n",
+ " )"
+ ],
+ "execution_count": 6,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "WARNING:root:Make sure that locally built Python SDK docker image
has Python 3.7 interpreter.\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "{'timestamp': '2021-03-20 09:37:00', 'event': 'March Equinox'}\n",
+ "{'timestamp': '2021-06-21 03:32:00', 'event': 'June Solstice'}\n",
+ "{'timestamp': '2021-09-22 19:21:00', 'event': 'September
Equinox'}\n",
+ "{'timestamp': '2021-12-21 15:59:00', 'event': 'December
Solstice'}\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "z6Q_tyWszkMC"
+ },
+ "source": [
+ "If you have your data as a PCollection of Pandas DataFrames, you can
convert them into a PCollection with\n",
+
"[`FlatMap`](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap).\n",
+ "\n",
+ "> ℹ️ If the number of elements in each DataFrame can be very
different, that is that some DataFrames contain thousands of elements while
others only contain a handful of elements, it might be a good idea to\n",
Review comment:
Thanks, done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]