damccorm commented on code in PR #25081: URL: https://github.com/apache/beam/pull/25081#discussion_r1082980208
########## website/www/site/content/en/documentation/ml/per-entity-training.md: ########## @@ -0,0 +1,64 @@ +--- +title: "Per Entity Training" +--- +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +# Per Entity Training +The aim of this pipeline example is to demonstrate per entity training in Beam. Per entity training refers to the process of training a machine learning model for each individual entity, rather than training a single model for all entities. In this approach, a separate model is trained for each entity based on the data specific to that entity. Per entity training can be beneficial in scenarios: + +* Having separate models allows for more personalized and tailored predictions for each group. This is because each group may have different characteristics, patterns, and behaviors that a single large model may not be able to capture effectively. + +* Having separate models can also help to reduce the complexity of the overall model and make it more efficient. This is because the overall model would only need to focus on the specific characteristics and patterns of the individual group, rather than trying to account for all possible characteristics and patterns across all groups. + +* It can also address the issue of bias and fairness, as a single model trained on a diverse dataset may not generalize well to certain groups, separate models for each group can reduce the impact of bias. + +* This approach is often favored in production settings as it allows for the detection of issues specific to a limited segment of the overall population with greater ease. + +* When working with smaller models and datasets, the process of retraining can be completed more rapidly and efficiently. Additionally, the ability to parallelize the process becomes more feasible when dealing with large amounts of data. Furthermore, smaller models and datasets also have the advantage of being less resource-intensive, which allows them to be run on less expensive hardware. + +## Dataset +This example uses [Adult Census Income dataset](https://archive.ics.uci.edu/ml/datasets/adult). The dataset contains information about individuals, including their demographic characteristics, employment status, and income level. The dataset includes both categorical and numerical features, such as age, education, occupation, and hours worked per week, as well as a binary label indicating whether an individual's income is above or below 50K. The primary goal of this dataset is to be used for classification tasks, where the model will predict whether an individual's income is above or below a certain threshold based on the provided features. + +### Run the Pipeline ? +First, install the required packages `apache-beam==2.44.0`, `scikit-learn==1.0.2` and `pandas==1.3.5`. +You can view the code on [GitHub](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/per_entity_training.py). +Use `python per_entity_training.py --input path_to_data` Review Comment: `path_to_data` - could you provide more specific info on the expected data/format? Is this supposed to be `adult.data` from https://archive.ics.uci.edu/ml/machine-learning-databases/adult/ ########## website/www/site/content/en/documentation/ml/per-entity-training.md: ########## @@ -0,0 +1,64 @@ +--- +title: "Per Entity Training" +--- +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +# Per Entity Training +The aim of this pipeline example is to demonstrate per entity training in Beam. Per entity training refers to the process of training a machine learning model for each individual entity, rather than training a single model for all entities. In this approach, a separate model is trained for each entity based on the data specific to that entity. Per entity training can be beneficial in scenarios: + +* Having separate models allows for more personalized and tailored predictions for each group. This is because each group may have different characteristics, patterns, and behaviors that a single large model may not be able to capture effectively. + +* Having separate models can also help to reduce the complexity of the overall model and make it more efficient. This is because the overall model would only need to focus on the specific characteristics and patterns of the individual group, rather than trying to account for all possible characteristics and patterns across all groups. + +* It can also address the issue of bias and fairness, as a single model trained on a diverse dataset may not generalize well to certain groups, separate models for each group can reduce the impact of bias. + +* This approach is often favored in production settings as it allows for the detection of issues specific to a limited segment of the overall population with greater ease. + +* When working with smaller models and datasets, the process of retraining can be completed more rapidly and efficiently. Additionally, the ability to parallelize the process becomes more feasible when dealing with large amounts of data. Furthermore, smaller models and datasets also have the advantage of being less resource-intensive, which allows them to be run on less expensive hardware. Review Comment: ```suggestion * When working with smaller models and datasets, the process of training and retraining can be completed more rapidly and efficiently. Both training and retraining can be done in parallel, reducing the amount of time spent waiting for results. Furthermore, smaller models and datasets also have the advantage of being less resource-intensive, which allows them to be run on less expensive hardware. ``` ########## website/www/site/content/en/documentation/ml/overview.md: ########## @@ -90,4 +90,5 @@ You can find examples of end-to-end AI/ML pipelines for several use cases: * [Multi model pipelines in Beam](/documentation/ml/multi-model-pipelines): Explains how multi-model pipelines work and gives an overview of what you need to know to build one using the RunInference API. * [Online Clustering in Beam](/documentation/ml/online-clustering): Demonstrates how to set up a real-time clustering pipeline that can read text from Pub/Sub, convert the text into an embedding using a transformer-based language model with the RunInference API, and cluster the text using BIRCH with stateful processing. * [Anomaly Detection in Beam](/documentation/ml/anomaly-detection): Demonstrates how to set up an anomaly detection pipeline that reads text from Pub/Sub in real time and then detects anomalies using a trained HDBSCAN clustering model with the RunInference API. -* [Large Language Model Inference in Beam](/documentation/ml/large-language-modeling): Demonstrates a pipeline that uses RunInference to perform translation with the T5 language model which contains 11 billion parameters. \ No newline at end of file +* [Large Language Model Inference in Beam](/documentation/ml/large-language-modeling): Demonstrates a pipeline that uses RunInference to perform translation with the T5 language model which contains 11 billion parameters. +* [Per Entity Training in Beam](/documentation/ml/per-entity-training): Demonstrates a pipeline that trains Decision Tree Classifier per education level for predicting if salary of a person is >= 50k. Review Comment: ```suggestion * [Per Entity Training in Beam](/documentation/ml/per-entity-training): Demonstrates a pipeline that trains a Decision Tree Classifier per education level for predicting if the salary of a person is >= 50k. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
