damccorm commented on code in PR #23497: URL: https://github.com/apache/beam/pull/23497#discussion_r989085361
########## website/www/site/content/en/documentation/ml/anomaly-detection.md: ########## @@ -0,0 +1,238 @@ +--- +title: "Anomaly Detection" +--- +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +# Anomaly Detection Example + +The AnomalyDetection example demonstrates how to setup an anomaly detection pipeline that reads text from PubSub in real-time, and then detects anomaly using a trained HDBSCAN clustering model. + + +### Dataset for Anomaly Detection +For the example, we use a dataset called [emotion](https://huggingface.co/datasets/emotion). It comprises of 20,000 English Twitter messages with 6 basic emotions: anger, fear, joy, love, sadness, and surprise. The dataset has three splits: train (for training), validation and test (for performance evaluation). It is a supervised dataset as it contains the text and the category (class) of the dataset. This dataset can easily be accessed using [HuggingFace Datasets](https://huggingface.co/docs/datasets/index). + +To have a better understanding of the dataset, here are some examples from the train split of the dataset: + + +| Text | Type of emotion | +| :--- | :----: | +| im grabbing a minute to post i feel greedy wrong | Anger | +| i am ever feeling nostalgic about the fireplace i will know that it is still on the property | Love | +| ive been taking or milligrams or times recommended amount and ive fallen asleep a lot faster but i also feel like so funny | Fear | +| on a boat trip to denmark | Joy | +| i feel you know basically like a fake in the realm of science fiction | Sadness | +| i began having them several times a week feeling tortured by the hallucinations moving people and figures sounds and vibrations | Fear | + +### Anomaly Detection Algorithm +[HDBSCAN](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html) is a clustering algorithm which extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters. Once trained, the model will predict -1 if a new data point is an outlier, otherwise it will predict one of the existing clusters. + + +## Ingestion to PubSub +We first ingest the data into [PubSub](https://cloud.google.com/pubsub/docs/overview) so that while clustering we can read the tweets from PubSub. PubSub is a messaging service for exchanging event data among applications and services. It is used for streaming analytics and data integration pipelines to ingest and distribute data. + +The full example code for ingesting data to PubSub can be found [here](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/inference/anomaly_detection/write_data_to_pubsub_pipeline/) + +The file structure for ingestion pipeline is: + + write_data_to_pubsub_pipeline/ + ├── pipeline/ + │ ├── __init__.py + │ ├── options.py + │ └── utils.py + ├── __init__.py + ├── config.py + ├── main.py + └── setup.py + +`pipeline/utils.py` contains the code for loading the emotion dataset and two `beam.DoFn` that are used for data transformation + +`pipeline/options.py` contains the pipeline options to configure the Dataflow pipeline + +`config.py` defines some variables like GCP PROJECT_ID, NUM_WORKERS that are used multiple times + +`setup.py` defines the packages/requirements for the pipeline to run + +`main.py` contains the pipeline code and some additional function used for running the pipeline + +### How to Run the Pipeline ? +First, make sure you have installed the required packages. + +1. Locally on your machine: `python main.py` +2. On GCP for Dataflow: `python main.py --mode cloud` + + +The `write_data_to_pubsub_pipeline` contains four different transforms: +1. Load emotion dataset using HuggingFace Datasets (we take samples from 3 classes instead of 6 for simplicity) +2. Associate each text with a unique identifier (UID) +3. Convert the text into a format PubSub is expecting +4. Write the formatted message to PubSub + + +## Anomaly Detection on Streaming Data + +After having the data ingested to PubSub, we can run the anomaly detection pipeline. This pipeline reads the streaming message from PubSub, converts the text to an embedding using a language model, and feeds the embedding to an already trained clustering model to predict if the message is anomaly or not. One prerequisite for this pipeline is to have a HDBSCAN clustering model trained on the training split of the dataset. + +The full example code for anomaly detection can be found [here](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/inference/anomaly_detection/anomaly_detection_pipeline/) + +The file structure for anomaly_detection pipeline is: + + anomaly_detection_pipeline/ + ├── pipeline/ + │ ├── __init__.py + │ ├── options.py + │ └── transformations.py + ├── __init__.py + ├── config.py + ├── main.py + └── setup.py + +`pipeline/transformations.py` contains the code for different `beam.DoFn` and additional functions that are used in pipeline + +`pipeline/options.py` contains the pipeline options to configure the Dataflow pipeline + +`config.py` defines some variables like GCP PROJECT_ID, NUM_WORKERS that are used multiple times + +`setup.py` defines the packages/requirements for the pipeline to run + +`main.py` contains the pipeline code and some additional functions used for running the pipeline + +### How to Run the Pipeline ? +First, make sure you have installed the required packages and you have pushed data to PubSub. + +1. Locally on your machine: `python main.py` +2. On GCP for Dataflow: `python main.py --mode cloud` + +The pipeline can be broken down into few simple steps: + +1. Reading the message from PubSub +2. Converting the PubSub message into a PCollection of dictionaries where the key is the UID and the value is the twitter text +3. Encoding the text into transformer-readable token ID integers using a tokenizer +4. Using RunInference to get the vector embedding from a Transformer based Language Model +5. Normalizing the embedding +6. Using RunInference to get anomaly prediction from a trained HDBSCAN clustering model +7. Writing the prediction to BQ, so that clustering model can be retrained when needed +8. Sending an email alert if anomaly is detected + + +The code snippet for the first two steps of the pipeline: + +{{< highlight >}} + docs = ( + pipeline + | "Read from PubSub" + >> ReadFromPubSub(subscription=cfg.SUBSCRIPTION_ID, with_attributes=True) + | "Decode PubSubMessage" >> beam.ParDo(Decode()) + ) +{{< /highlight >}} + +We will now focus on important steps of pipeline: tokenizing the text, getting embedding using RunInference and finally getting prediction from HDBSCAN model. + +### Getting Embedding from a Language Model + +In order to do clustering with text data, we first need to map the text into vectors of numerical values suitable for statistical analysis. We use a transformer based language model called [sentence-transformers/stsb-distilbert-base/stsb-distilbert-base](https://huggingface.co/sentence-transformers/stsb-distilbert-base). It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search. But, we first need to tokenize the text as the language model is expecting a tokenized input instead of raw text. + +Tokenization is a preprocessing task that transforms text in a way that it can be fed into the model for getting predictions. Review Comment: ```suggestion Tokenization is a preprocessing task that transforms text so that it can be fed into the model for getting predictions. ``` Sorry, one more wording nit I missed last time ########## website/www/site/content/en/documentation/ml/anomaly-detection.md: ########## @@ -0,0 +1,238 @@ +--- +title: "Anomaly Detection" +--- +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +# Anomaly Detection Example + +The AnomalyDetection example demonstrates how to setup an anomaly detection pipeline that reads text from PubSub in real-time, and then detects anomaly using a trained HDBSCAN clustering model. + + +### Dataset for Anomaly Detection +For the example, we use a dataset called [emotion](https://huggingface.co/datasets/emotion). It comprises of 20,000 English Twitter messages with 6 basic emotions: anger, fear, joy, love, sadness, and surprise. The dataset has three splits: train (for training), validation and test (for performance evaluation). It is a supervised dataset as it contains the text and the category (class) of the dataset. This dataset can easily be accessed using [HuggingFace Datasets](https://huggingface.co/docs/datasets/index). + +To have a better understanding of the dataset, here are some examples from the train split of the dataset: + + +| Text | Type of emotion | +| :--- | :----: | +| im grabbing a minute to post i feel greedy wrong | Anger | +| i am ever feeling nostalgic about the fireplace i will know that it is still on the property | Love | +| ive been taking or milligrams or times recommended amount and ive fallen asleep a lot faster but i also feel like so funny | Fear | +| on a boat trip to denmark | Joy | +| i feel you know basically like a fake in the realm of science fiction | Sadness | +| i began having them several times a week feeling tortured by the hallucinations moving people and figures sounds and vibrations | Fear | + +### Anomaly Detection Algorithm +[HDBSCAN](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html) is a clustering algorithm which extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters. Once trained, the model will predict -1 if a new data point is an outlier, otherwise it will predict one of the existing clusters. + + +## Ingestion to PubSub +We first ingest the data into [PubSub](https://cloud.google.com/pubsub/docs/overview) so that while clustering we can read the tweets from PubSub. PubSub is a messaging service for exchanging event data among applications and services. It is used for streaming analytics and data integration pipelines to ingest and distribute data. + +The full example code for ingesting data to PubSub can be found [here](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/inference/anomaly_detection/write_data_to_pubsub_pipeline/) + +The file structure for ingestion pipeline is: + + write_data_to_pubsub_pipeline/ + ├── pipeline/ + │ ├── __init__.py + │ ├── options.py + │ └── utils.py + ├── __init__.py + ├── config.py + ├── main.py + └── setup.py + +`pipeline/utils.py` contains the code for loading the emotion dataset and two `beam.DoFn` that are used for data transformation + +`pipeline/options.py` contains the pipeline options to configure the Dataflow pipeline + +`config.py` defines some variables like GCP PROJECT_ID, NUM_WORKERS that are used multiple times + +`setup.py` defines the packages/requirements for the pipeline to run + +`main.py` contains the pipeline code and some additional function used for running the pipeline Review Comment: ```suggestion `main.py` contains the pipeline code and some additional functions used for running the pipeline ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
