[GitHub] [beam] damccorm commented on a diff in pull request #23497: Add example of real time Anomaly Detection using RunInference

GitBox Thu, 06 Oct 2022 06:59:04 -0700


damccorm commented on code in PR #23497:
URL: https://github.com/apache/beam/pull/23497#discussion_r989085361



##########
website/www/site/content/en/documentation/ml/anomaly-detection.md:
##########
@@ -0,0 +1,238 @@
+---
+title: "Anomaly Detection"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Anomaly Detection Example
+
+The AnomalyDetection example demonstrates how to setup an anomaly detection 
pipeline that reads text from PubSub in real-time, and then detects anomaly 
using a trained HDBSCAN clustering model.
+
+
+### Dataset for Anomaly Detection
+For the example, we use a dataset called 
[emotion](https://huggingface.co/datasets/emotion). It comprises of 20,000 
English Twitter messages with 6 basic emotions: anger, fear, joy, love, 
sadness, and surprise. The dataset has three splits: train (for training), 
validation and test (for performance evaluation). It is a supervised dataset as 
it contains the text and the category (class) of the dataset. This dataset can 
easily be accessed using [HuggingFace 
Datasets](https://huggingface.co/docs/datasets/index).
+
+To have a better understanding of the dataset, here are some examples from the 
train split of the dataset:
+
+
+| Text        | Type of emotion |
+| :---        |    :----:   |
+| im grabbing a minute to post i feel greedy wrong      | Anger       |
+| i am ever feeling nostalgic about the fireplace i will know that it is still 
on the property   | Love        |
+| ive been taking or milligrams or times recommended amount and ive fallen 
asleep a lot faster but i also feel like so funny | Fear |
+| on a boat trip to denmark | Joy |
+| i feel you know basically like a fake in the realm of science fiction | 
Sadness |
+| i began having them several times a week feeling tortured by the 
hallucinations moving people and figures sounds and vibrations | Fear |
+
+### Anomaly Detection Algorithm
+[HDBSCAN](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html) is 
a clustering algorithm which extends DBSCAN by converting it into a 
hierarchical clustering algorithm, and then using a technique to extract a flat 
clustering based in the stability of clusters. Once trained, the model will 
predict -1 if a new data point is an outlier, otherwise it will predict one of 
the existing clusters.
+
+
+## Ingestion to PubSub
+We first ingest the data into 
[PubSub](https://cloud.google.com/pubsub/docs/overview) so that while 
clustering we can read the tweets from PubSub. PubSub is a messaging service 
for exchanging event data among applications and services. It is used for 
streaming analytics and data integration pipelines to ingest and distribute 
data.
+
+The full example code for ingesting data to PubSub can be found 
[here](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/inference/anomaly_detection/write_data_to_pubsub_pipeline/)
+
+The file structure for ingestion pipeline is:
+
+    write_data_to_pubsub_pipeline/
+    ├── pipeline/
+    │   ├── __init__.py
+    │   ├── options.py
+    │   └── utils.py
+    ├── __init__.py
+    ├── config.py
+    ├── main.py
+    └── setup.py
+
+`pipeline/utils.py` contains the code for loading the emotion dataset and two 
`beam.DoFn` that are used for data transformation
+
+`pipeline/options.py` contains the pipeline options to configure the Dataflow 
pipeline
+
+`config.py` defines some variables like GCP PROJECT_ID, NUM_WORKERS that are 
used multiple times
+
+`setup.py` defines the packages/requirements for the pipeline to run
+
+`main.py` contains the pipeline code and some additional function used for 
running the pipeline
+
+### How to Run the Pipeline ?
+First, make sure you have installed the required packages.
+
+1. Locally on your machine: `python main.py`
+2. On GCP for Dataflow: `python main.py --mode cloud`
+
+
+The `write_data_to_pubsub_pipeline` contains four different transforms:
+1. Load emotion dataset using HuggingFace Datasets (we take samples from 3 
classes instead of 6 for simplicity)
+2. Associate each text with a unique identifier (UID)
+3. Convert the text into a format PubSub is expecting
+4. Write the formatted message to PubSub
+
+
+## Anomaly Detection on Streaming Data
+
+After having the data ingested to PubSub, we can run the anomaly detection 
pipeline. This pipeline reads the streaming message from PubSub, converts the 
text to an embedding using a language model, and feeds the embedding to an 
already trained clustering model to predict if the message is anomaly or not. 
One prerequisite for this pipeline is to have a HDBSCAN clustering model 
trained on the training split of the dataset.
+
+The full example code for anomaly detection can be found 
[here](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/inference/anomaly_detection/anomaly_detection_pipeline/)
+
+The file structure for anomaly_detection pipeline is:
+
+    anomaly_detection_pipeline/
+    ├── pipeline/
+    │   ├── __init__.py
+    │   ├── options.py
+    │   └── transformations.py
+    ├── __init__.py
+    ├── config.py
+    ├── main.py
+    └── setup.py
+
+`pipeline/transformations.py` contains the code for different `beam.DoFn` and 
additional functions that are used in pipeline
+
+`pipeline/options.py` contains the pipeline options to configure the Dataflow 
pipeline
+
+`config.py` defines some variables like GCP PROJECT_ID, NUM_WORKERS that are 
used multiple times
+
+`setup.py` defines the packages/requirements for the pipeline to run
+
+`main.py` contains the pipeline code and some additional functions used for 
running the pipeline
+
+### How to Run the Pipeline ?
+First, make sure you have installed the required packages and you have pushed 
data to PubSub.
+
+1. Locally on your machine: `python main.py`
+2. On GCP for Dataflow: `python main.py --mode cloud`
+
+The pipeline can be broken down into few simple steps:
+
+1. Reading the message from PubSub
+2. Converting the PubSub message into a PCollection of dictionaries where the 
key is the UID and the value is the twitter text
+3. Encoding the text into transformer-readable token ID integers using a 
tokenizer
+4. Using RunInference to get the vector embedding from a Transformer based 
Language Model
+5. Normalizing the embedding
+6. Using RunInference to get anomaly prediction from a trained HDBSCAN 
clustering model
+7. Writing the prediction to BQ, so that clustering model can be retrained 
when needed
+8. Sending an email alert if anomaly is detected
+
+
+The code snippet for the first two steps of the pipeline:
+
+{{< highlight >}}
+    docs = (
+        pipeline
+        | "Read from PubSub"
+        >> ReadFromPubSub(subscription=cfg.SUBSCRIPTION_ID, 
with_attributes=True)
+        | "Decode PubSubMessage" >> beam.ParDo(Decode())
+    )
+{{< /highlight >}}
+
+We will now focus on important steps of pipeline: tokenizing the text, getting 
embedding using RunInference and finally getting prediction from HDBSCAN model.
+
+### Getting Embedding from a Language Model
+
+In order to do clustering with text data, we first need to map the text into 
vectors of numerical values suitable for statistical analysis. We use a 
transformer based language model called 
[sentence-transformers/stsb-distilbert-base/stsb-distilbert-base](https://huggingface.co/sentence-transformers/stsb-distilbert-base).
 It maps sentences & paragraphs to a 768 dimensional dense vector space and can 
be used for tasks like clustering or semantic search. But, we first need to 
tokenize the text as the language model is expecting a tokenized input instead 
of raw text.
+
+Tokenization is a preprocessing task that transforms text in a way that it can 
be fed into the model for getting predictions.

Review Comment:
   ```suggestion
   Tokenization is a preprocessing task that transforms text so that it can be 
fed into the model for getting predictions.
   ```
   
   Sorry, one more wording nit I missed last time



##########
website/www/site/content/en/documentation/ml/anomaly-detection.md:
##########
@@ -0,0 +1,238 @@
+---
+title: "Anomaly Detection"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Anomaly Detection Example
+
+The AnomalyDetection example demonstrates how to setup an anomaly detection 
pipeline that reads text from PubSub in real-time, and then detects anomaly 
using a trained HDBSCAN clustering model.
+
+
+### Dataset for Anomaly Detection
+For the example, we use a dataset called 
[emotion](https://huggingface.co/datasets/emotion). It comprises of 20,000 
English Twitter messages with 6 basic emotions: anger, fear, joy, love, 
sadness, and surprise. The dataset has three splits: train (for training), 
validation and test (for performance evaluation). It is a supervised dataset as 
it contains the text and the category (class) of the dataset. This dataset can 
easily be accessed using [HuggingFace 
Datasets](https://huggingface.co/docs/datasets/index).
+
+To have a better understanding of the dataset, here are some examples from the 
train split of the dataset:
+
+
+| Text        | Type of emotion |
+| :---        |    :----:   |
+| im grabbing a minute to post i feel greedy wrong      | Anger       |
+| i am ever feeling nostalgic about the fireplace i will know that it is still 
on the property   | Love        |
+| ive been taking or milligrams or times recommended amount and ive fallen 
asleep a lot faster but i also feel like so funny | Fear |
+| on a boat trip to denmark | Joy |
+| i feel you know basically like a fake in the realm of science fiction | 
Sadness |
+| i began having them several times a week feeling tortured by the 
hallucinations moving people and figures sounds and vibrations | Fear |
+
+### Anomaly Detection Algorithm
+[HDBSCAN](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html) is 
a clustering algorithm which extends DBSCAN by converting it into a 
hierarchical clustering algorithm, and then using a technique to extract a flat 
clustering based in the stability of clusters. Once trained, the model will 
predict -1 if a new data point is an outlier, otherwise it will predict one of 
the existing clusters.
+
+
+## Ingestion to PubSub
+We first ingest the data into 
[PubSub](https://cloud.google.com/pubsub/docs/overview) so that while 
clustering we can read the tweets from PubSub. PubSub is a messaging service 
for exchanging event data among applications and services. It is used for 
streaming analytics and data integration pipelines to ingest and distribute 
data.
+
+The full example code for ingesting data to PubSub can be found 
[here](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/inference/anomaly_detection/write_data_to_pubsub_pipeline/)
+
+The file structure for ingestion pipeline is:
+
+    write_data_to_pubsub_pipeline/
+    ├── pipeline/
+    │   ├── __init__.py
+    │   ├── options.py
+    │   └── utils.py
+    ├── __init__.py
+    ├── config.py
+    ├── main.py
+    └── setup.py
+
+`pipeline/utils.py` contains the code for loading the emotion dataset and two 
`beam.DoFn` that are used for data transformation
+
+`pipeline/options.py` contains the pipeline options to configure the Dataflow 
pipeline
+
+`config.py` defines some variables like GCP PROJECT_ID, NUM_WORKERS that are 
used multiple times
+
+`setup.py` defines the packages/requirements for the pipeline to run
+
+`main.py` contains the pipeline code and some additional function used for 
running the pipeline

Review Comment:
   ```suggestion
   `main.py` contains the pipeline code and some additional functions used for 
running the pipeline
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] damccorm commented on a diff in pull request #23497: Add example of real time Anomaly Detection using RunInference

Reply via email to