damccorm commented on code in PR #34459: URL: https://github.com/apache/beam/pull/34459#discussion_r2019014506
########## examples/notebooks/beam-ml/anomaly_detection/anomaly_detection_zscore.ipynb: ########## @@ -0,0 +1,653 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "<a href=\"https://colab.research.google.com/github/shunping/beam/blob/anomaly-detection-notebook-1/anomaly_detection_zscore.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" Review Comment: Please remove in favor of a cell below the title like https://github.com/apache/beam/blob/4ac20255c993906045bb720618324575b50689d7/examples/notebooks/beam-ml/automatic_model_refresh.ipynb#L45 (we do this across all our notebooks) ########## examples/notebooks/beam-ml/anomaly_detection/anomaly_detection_zscore.ipynb: ########## @@ -0,0 +1,653 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "<a href=\"https://colab.research.google.com/github/shunping/beam/blob/anomaly-detection-notebook-1/anomaly_detection_zscore.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2d79fe3a-952b-478f-ba78-44cafddc91d1", + "metadata": { + "cellView": "form", + "id": "2d79fe3a-952b-478f-ba78-44cafddc91d1" + }, + "outputs": [], + "source": [ + "# @title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n", + "\n", + "# Licensed to the Apache Software Foundation (ASF) under one\n", + "# or more contributor license agreements. See the NOTICE file\n", + "# distributed with this work for additional information\n", + "# regarding copyright ownership. The ASF licenses this file\n", + "# to you under the Apache License, Version 2.0 (the\n", + "# \"License\"); you may not use this file except in compliance\n", + "# with the License. You may obtain a copy of the License at\n", + "#\n", + "# http://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing,\n", + "# software distributed under the License is distributed on an\n", + "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n", + "# KIND, either express or implied. See the License for the\n", + "# specific language governing permissions and limitations\n", + "# under the License" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Use Apache Beam for Anomaly Detection on Streaming Data (Z-Score)\n", + "This notebook demonstrates how to perform anomaly detection on streaming data using the `AnomalyDetection` PTransform, a new feature introduced in Apache Beam 2.64.0.\n", + "\n", + "We will first generate a synthetic dataset that incorporates various types of concept drift (changes in the underlying data distribution). This data will then be published to a Pub/Sub topic to simulate a real-time input stream. Our Beam pipeline will read from this topic, apply the `AnomalyDetection` PTransform with the Z-Score algorithm, and publish the detected anomalies to a second Pub/Sub topic. Finally, we will visualize the labeled data points in an animated plot." + ], + "metadata": { + "id": "vOwDXurVLkxO" + }, + "id": "vOwDXurVLkxO" + }, + { + "cell_type": "markdown", + "source": [ + "## Preparation\n", + "To get started with this notebook, you'll need to install the Apache Beam Python SDK and its associated extras. Make sure your installation is version 2.64.0 or later." + ], + "metadata": { + "id": "UQx1jKZ2LHHR" + }, + "id": "UQx1jKZ2LHHR" + }, + { + "cell_type": "code", + "source": [ + "! pip install apache_beam[interactive,gcp]==2.64.0rc2 --quiet" + ], + "metadata": { + "collapsed": true, + "id": "SafqA6dALKvo" + }, + "id": "SafqA6dALKvo", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "To proceed, import the essential modules: matplotlib, numpy, pandas, Beam, and others as needed." + ], + "metadata": { + "id": "h5gJCJGpStSb" + }, + "id": "h5gJCJGpStSb" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8fb71376-b0eb-474b-ab51-2161dfa60e2d", + "metadata": { + "id": "8fb71376-b0eb-474b-ab51-2161dfa60e2d" + }, + "outputs": [], + "source": [ + "# Imports Required for the notebook\n", + "import json\n", + "import os\n", + "import random\n", + "import threading\n", + "import time\n", + "import warnings\n", + "from typing import Any\n", + "from typing import Iterable\n", + "from typing import Tuple\n", + "\n", + "import matplotlib.animation\n", + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "import pandas as pd\n", + "from IPython.display import HTML, Javascript\n", + "from google.api_core import retry\n", + "from google.api_core.exceptions import AlreadyExists\n", + "from google.cloud import pubsub_v1\n", + "from google.cloud.exceptions import NotFound\n", + "\n", + "import apache_beam as beam\n", + "from apache_beam.io.gcp.pubsub import PubsubMessage\n", + "from apache_beam.ml.anomaly.base import AnomalyResult\n", + "from apache_beam.ml.anomaly.base import AnomalyPrediction\n", + "from apache_beam.ml.anomaly.detectors.zscore import ZScore\n", + "from apache_beam.ml.anomaly.transforms import AnomalyDetection\n", + "from apache_beam.ml.anomaly.univariate.mean import IncSlidingMeanTracker\n", + "from apache_beam.ml.anomaly.univariate.stdev import IncSlidingStdevTracker\n", + "from apache_beam.options.pipeline_options import PipelineOptions\n", + "\n", + "# Suppress logging warnings\n", + "os.environ[\"GRPC_VERBOSITY\"] = \"ERROR\"\n", + "os.environ[\"GLOG_minloglevel\"] = \"2\"\n", + "warnings.filterwarnings('ignore')" + ] + }, + { + "cell_type": "markdown", + "source": [ + " Next, replace `<PROJECT_ID>` with your Google Cloud project ID." + ], + "metadata": { + "id": "C2GhkkvSUXdu" + }, + "id": "C2GhkkvSUXdu" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "29cda7f0-a24e-4e74-ba6e-166413ab532c", + "metadata": { + "id": "29cda7f0-a24e-4e74-ba6e-166413ab532c" + }, + "outputs": [], + "source": [ + "# GCP-related constant are listed below\n", + "\n", + "# GCP project id\n", + "PROJECT_ID = '<PROJECT_ID>' # @param {type:'string'}\n", + "\n", + "SUFFIX = str(random.randint(0, 10000))\n", + "\n", + "# Pubsub topic and subscription for retrieving input data\n", + "INPUT_TOPIC = 'anomaly-input-' + SUFFIX\n", + "INPUT_SUB = INPUT_TOPIC + '-sub'\n", + "\n", + "# Pubsub topic and subscription for collecting output result\n", + "OUTPUT_TOPIC = 'anomaly-output-' + SUFFIX\n", + "OUTPUT_SUB = OUTPUT_TOPIC + '-sub'" + ] + }, + { + "cell_type": "markdown", + "source": [ + "The last preparation step needs to authenticate your Google account and authorize your Colab notebook to access Google Cloud Platform (GCP) resources associated with the project set above." + ], + "metadata": { + "id": "coZF7R8zTBsF" + }, + "id": "coZF7R8zTBsF" + }, + { + "cell_type": "code", + "source": [ + "from google.colab import auth\n", + "auth.authenticate_user(project_id=PROJECT_ID)" + ], + "metadata": { + "id": "51jml7JvMpbD" + }, + "id": "51jml7JvMpbD", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "id": "0064575d-5e60-4f8b-a970-9dc39db8d331", + "metadata": { + "id": "0064575d-5e60-4f8b-a970-9dc39db8d331" + }, + "source": [ + "## Generating Synthetic Data with Concept Drifting" Review Comment: ```suggestion "## Generating Synthetic Data with Concept Drift" ``` Nit: I think this is the right tense/may help with SEO ########## examples/notebooks/beam-ml/anomaly_detection/anomaly_detection_zscore.ipynb: ########## @@ -0,0 +1,653 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "<a href=\"https://colab.research.google.com/github/shunping/beam/blob/anomaly-detection-notebook-1/anomaly_detection_zscore.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2d79fe3a-952b-478f-ba78-44cafddc91d1", + "metadata": { + "cellView": "form", + "id": "2d79fe3a-952b-478f-ba78-44cafddc91d1" + }, + "outputs": [], + "source": [ + "# @title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n", + "\n", + "# Licensed to the Apache Software Foundation (ASF) under one\n", + "# or more contributor license agreements. See the NOTICE file\n", + "# distributed with this work for additional information\n", + "# regarding copyright ownership. The ASF licenses this file\n", + "# to you under the Apache License, Version 2.0 (the\n", + "# \"License\"); you may not use this file except in compliance\n", + "# with the License. You may obtain a copy of the License at\n", + "#\n", + "# http://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing,\n", + "# software distributed under the License is distributed on an\n", + "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n", + "# KIND, either express or implied. See the License for the\n", + "# specific language governing permissions and limitations\n", + "# under the License" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Use Apache Beam for Anomaly Detection on Streaming Data (Z-Score)\n", + "This notebook demonstrates how to perform anomaly detection on streaming data using the `AnomalyDetection` PTransform, a new feature introduced in Apache Beam 2.64.0.\n", + "\n", + "We will first generate a synthetic dataset that incorporates various types of concept drift (changes in the underlying data distribution). This data will then be published to a Pub/Sub topic to simulate a real-time input stream. Our Beam pipeline will read from this topic, apply the `AnomalyDetection` PTransform with the Z-Score algorithm, and publish the detected anomalies to a second Pub/Sub topic. Finally, we will visualize the labeled data points in an animated plot." + ], + "metadata": { + "id": "vOwDXurVLkxO" + }, + "id": "vOwDXurVLkxO" + }, + { + "cell_type": "markdown", + "source": [ + "## Preparation\n", + "To get started with this notebook, you'll need to install the Apache Beam Python SDK and its associated extras. Make sure your installation is version 2.64.0 or later." + ], + "metadata": { + "id": "UQx1jKZ2LHHR" + }, + "id": "UQx1jKZ2LHHR" + }, + { + "cell_type": "code", + "source": [ + "! pip install apache_beam[interactive,gcp]==2.64.0rc2 --quiet" Review Comment: ```suggestion "! pip install apache_beam[interactive,gcp]>=2.64.0 --quiet" ``` Adding this so we don't forget -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org