[GitHub] [beam] rszper commented on a diff in pull request #25081: Add working example for Per Entity Training

via GitHub Fri, 20 Jan 2023 17:10:05 -0800


rszper commented on code in PR #25081:
URL: https://github.com/apache/beam/pull/25081#discussion_r1082992759



##########
website/www/site/content/en/documentation/ml/per-entity-training.md:
##########
@@ -0,0 +1,64 @@
+---
+title: "Per Entity Training"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Per Entity Training
+The aim of this pipeline example is to demonstrate per entity training in 
Beam. Per entity training refers to the process of training a machine learning 
model for each individual entity, rather than training a single model for all 
entities. In this approach, a separate model is trained for each entity based 
on the data specific to that entity. Per entity training can be beneficial in 
scenarios:
+
+* Having separate models allows for more personalized and tailored predictions 
for each group. This is because each group may have different characteristics, 
patterns, and behaviors that a single large model may not be able to capture 
effectively.

Review Comment:
   For this point and the next, remove "This is because" from the start of the 
second sentence.



##########
website/www/site/content/en/documentation/ml/per-entity-training.md:
##########
@@ -0,0 +1,64 @@
+---
+title: "Per Entity Training"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Per Entity Training
+The aim of this pipeline example is to demonstrate per entity training in 
Beam. Per entity training refers to the process of training a machine learning 
model for each individual entity, rather than training a single model for all 
entities. In this approach, a separate model is trained for each entity based 
on the data specific to that entity. Per entity training can be beneficial in 
scenarios:
+
+* Having separate models allows for more personalized and tailored predictions 
for each group. This is because each group may have different characteristics, 
patterns, and behaviors that a single large model may not be able to capture 
effectively.
+
+* Having separate models can also help to reduce the complexity of the overall 
model and make it more efficient. This is because the overall model would only 
need to focus on the specific characteristics and patterns of the individual 
group, rather than trying to account for all possible characteristics and 
patterns across all groups.
+
+* It can also address the issue of bias and fairness, as a single model 
trained on a diverse dataset may not generalize well to certain groups, 
separate models for each group can reduce the impact of bias.

Review Comment:
   Change to:
   
   Having separate models can address issues of bias and fairness. Because a 
single model trained on a diverse dataset might not generalize well to certain 
groups, separate models for each group can reduce the impact of bias.



##########
website/www/site/content/en/documentation/ml/per-entity-training.md:
##########
@@ -0,0 +1,64 @@
+---
+title: "Per Entity Training"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Per Entity Training
+The aim of this pipeline example is to demonstrate per entity training in 
Beam. Per entity training refers to the process of training a machine learning 
model for each individual entity, rather than training a single model for all 
entities. In this approach, a separate model is trained for each entity based 
on the data specific to that entity. Per entity training can be beneficial in 
scenarios:
+
+* Having separate models allows for more personalized and tailored predictions 
for each group. This is because each group may have different characteristics, 
patterns, and behaviors that a single large model may not be able to capture 
effectively.
+
+* Having separate models can also help to reduce the complexity of the overall 
model and make it more efficient. This is because the overall model would only 
need to focus on the specific characteristics and patterns of the individual 
group, rather than trying to account for all possible characteristics and 
patterns across all groups.
+
+* It can also address the issue of bias and fairness, as a single model 
trained on a diverse dataset may not generalize well to certain groups, 
separate models for each group can reduce the impact of bias.
+
+* This approach is often favored in production settings as it allows for the 
detection of issues specific to a limited segment of the overall population 
with greater ease.

Review Comment:
   Change to:
   
   This approach is often favored in production settings, because it makes it 
easier to detect issues specific to a limited segment of the overall population.



##########
website/www/site/content/en/documentation/ml/per-entity-training.md:
##########
@@ -0,0 +1,64 @@
+---
+title: "Per Entity Training"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Per Entity Training
+The aim of this pipeline example is to demonstrate per entity training in 
Beam. Per entity training refers to the process of training a machine learning 
model for each individual entity, rather than training a single model for all 
entities. In this approach, a separate model is trained for each entity based 
on the data specific to that entity. Per entity training can be beneficial in 
scenarios:
+
+* Having separate models allows for more personalized and tailored predictions 
for each group. This is because each group may have different characteristics, 
patterns, and behaviors that a single large model may not be able to capture 
effectively.
+
+* Having separate models can also help to reduce the complexity of the overall 
model and make it more efficient. This is because the overall model would only 
need to focus on the specific characteristics and patterns of the individual 
group, rather than trying to account for all possible characteristics and 
patterns across all groups.
+
+* It can also address the issue of bias and fairness, as a single model 
trained on a diverse dataset may not generalize well to certain groups, 
separate models for each group can reduce the impact of bias.
+
+* This approach is often favored in production settings as it allows for the 
detection of issues specific to a limited segment of the overall population 
with greater ease.
+
+* When working with smaller models and datasets, the process of retraining can 
be completed more rapidly and efficiently. Additionally, the ability to 
parallelize the process becomes more feasible when dealing with large amounts 
of data. Furthermore, smaller models and datasets also have the advantage of 
being less resource-intensive, which allows them to be run on less expensive 
hardware.
+
+## Dataset
+This example uses [Adult Census Income 
dataset](https://archive.ics.uci.edu/ml/datasets/adult). The dataset contains 
information about individuals, including their demographic characteristics, 
employment status, and income level. The dataset includes both categorical and 
numerical features, such as age, education, occupation, and hours worked per 
week, as well as a binary label indicating whether an individual's income is 
above or below 50K. The primary goal of this dataset is to be used for 
classification tasks, where the model will predict whether an individual's 
income is above or below a certain threshold based on the provided features.

Review Comment:
   Instead of 50K, we should say 50,000 USD.



##########
website/www/site/content/en/documentation/ml/per-entity-training.md:
##########
@@ -0,0 +1,64 @@
+---
+title: "Per Entity Training"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Per Entity Training
+The aim of this pipeline example is to demonstrate per entity training in 
Beam. Per entity training refers to the process of training a machine learning 
model for each individual entity, rather than training a single model for all 
entities. In this approach, a separate model is trained for each entity based 
on the data specific to that entity. Per entity training can be beneficial in 
scenarios:
+
+* Having separate models allows for more personalized and tailored predictions 
for each group. This is because each group may have different characteristics, 
patterns, and behaviors that a single large model may not be able to capture 
effectively.
+
+* Having separate models can also help to reduce the complexity of the overall 
model and make it more efficient. This is because the overall model would only 
need to focus on the specific characteristics and patterns of the individual 
group, rather than trying to account for all possible characteristics and 
patterns across all groups.
+
+* It can also address the issue of bias and fairness, as a single model 
trained on a diverse dataset may not generalize well to certain groups, 
separate models for each group can reduce the impact of bias.
+
+* This approach is often favored in production settings as it allows for the 
detection of issues specific to a limited segment of the overall population 
with greater ease.
+
+* When working with smaller models and datasets, the process of retraining can 
be completed more rapidly and efficiently. Additionally, the ability to 
parallelize the process becomes more feasible when dealing with large amounts 
of data. Furthermore, smaller models and datasets also have the advantage of 
being less resource-intensive, which allows them to be run on less expensive 
hardware.
+
+## Dataset
+This example uses [Adult Census Income 
dataset](https://archive.ics.uci.edu/ml/datasets/adult). The dataset contains 
information about individuals, including their demographic characteristics, 
employment status, and income level. The dataset includes both categorical and 
numerical features, such as age, education, occupation, and hours worked per 
week, as well as a binary label indicating whether an individual's income is 
above or below 50K. The primary goal of this dataset is to be used for 
classification tasks, where the model will predict whether an individual's 
income is above or below a certain threshold based on the provided features.
+
+### Run the Pipeline ?
+First, install the required packages `apache-beam==2.44.0`, 
`scikit-learn==1.0.2` and `pandas==1.3.5`.
+You can view the code on 
[GitHub](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/per_entity_training.py).
+Use `python per_entity_training.py --input path_to_data`
+
+
+### Training pipeline

Review Comment:
   Change heading to: Train the pipeline



##########
website/www/site/content/en/documentation/ml/per-entity-training.md:
##########
@@ -0,0 +1,64 @@
+---
+title: "Per Entity Training"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Per Entity Training
+The aim of this pipeline example is to demonstrate per entity training in 
Beam. Per entity training refers to the process of training a machine learning 
model for each individual entity, rather than training a single model for all 
entities. In this approach, a separate model is trained for each entity based 
on the data specific to that entity. Per entity training can be beneficial in 
scenarios:
+
+* Having separate models allows for more personalized and tailored predictions 
for each group. This is because each group may have different characteristics, 
patterns, and behaviors that a single large model may not be able to capture 
effectively.
+
+* Having separate models can also help to reduce the complexity of the overall 
model and make it more efficient. This is because the overall model would only 
need to focus on the specific characteristics and patterns of the individual 
group, rather than trying to account for all possible characteristics and 
patterns across all groups.
+
+* It can also address the issue of bias and fairness, as a single model 
trained on a diverse dataset may not generalize well to certain groups, 
separate models for each group can reduce the impact of bias.
+
+* This approach is often favored in production settings as it allows for the 
detection of issues specific to a limited segment of the overall population 
with greater ease.
+
+* When working with smaller models and datasets, the process of retraining can 
be completed more rapidly and efficiently. Additionally, the ability to 
parallelize the process becomes more feasible when dealing with large amounts 
of data. Furthermore, smaller models and datasets also have the advantage of 
being less resource-intensive, which allows them to be run on less expensive 
hardware.
+
+## Dataset
+This example uses [Adult Census Income 
dataset](https://archive.ics.uci.edu/ml/datasets/adult). The dataset contains 
information about individuals, including their demographic characteristics, 
employment status, and income level. The dataset includes both categorical and 
numerical features, such as age, education, occupation, and hours worked per 
week, as well as a binary label indicating whether an individual's income is 
above or below 50K. The primary goal of this dataset is to be used for 
classification tasks, where the model will predict whether an individual's 
income is above or below a certain threshold based on the provided features.
+
+### Run the Pipeline ?
+First, install the required packages `apache-beam==2.44.0`, 
`scikit-learn==1.0.2` and `pandas==1.3.5`.
+You can view the code on 
[GitHub](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/per_entity_training.py).
+Use `python per_entity_training.py --input path_to_data`
+
+
+### Training pipeline
+The pipeline can be broken down into the following main steps:
+1. Reading the data from the provided input path.

Review Comment:
   Better to use infinitives than gerunds, i.e.:
   
   Reading -> Read
   Filtering -> Filter
   Creating -> Create
   Grouping -> Group
   Preprocessing -> Preprocess
   Training -> Train
   Saving -> Save



##########
website/www/site/content/en/documentation/ml/per-entity-training.md:
##########
@@ -0,0 +1,64 @@
+---
+title: "Per Entity Training"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Per Entity Training
+The aim of this pipeline example is to demonstrate per entity training in 
Beam. Per entity training refers to the process of training a machine learning 
model for each individual entity, rather than training a single model for all 
entities. In this approach, a separate model is trained for each entity based 
on the data specific to that entity. Per entity training can be beneficial in 
scenarios:
+
+* Having separate models allows for more personalized and tailored predictions 
for each group. This is because each group may have different characteristics, 
patterns, and behaviors that a single large model may not be able to capture 
effectively.
+
+* Having separate models can also help to reduce the complexity of the overall 
model and make it more efficient. This is because the overall model would only 
need to focus on the specific characteristics and patterns of the individual 
group, rather than trying to account for all possible characteristics and 
patterns across all groups.
+
+* It can also address the issue of bias and fairness, as a single model 
trained on a diverse dataset may not generalize well to certain groups, 
separate models for each group can reduce the impact of bias.
+
+* This approach is often favored in production settings as it allows for the 
detection of issues specific to a limited segment of the overall population 
with greater ease.
+
+* When working with smaller models and datasets, the process of retraining can 
be completed more rapidly and efficiently. Additionally, the ability to 
parallelize the process becomes more feasible when dealing with large amounts 
of data. Furthermore, smaller models and datasets also have the advantage of 
being less resource-intensive, which allows them to be run on less expensive 
hardware.
+
+## Dataset
+This example uses [Adult Census Income 
dataset](https://archive.ics.uci.edu/ml/datasets/adult). The dataset contains 
information about individuals, including their demographic characteristics, 
employment status, and income level. The dataset includes both categorical and 
numerical features, such as age, education, occupation, and hours worked per 
week, as well as a binary label indicating whether an individual's income is 
above or below 50K. The primary goal of this dataset is to be used for 
classification tasks, where the model will predict whether an individual's 
income is above or below a certain threshold based on the provided features.
+
+### Run the Pipeline ?

Review Comment:
   Remove the question mark from the heading



##########
website/www/site/content/en/documentation/ml/per-entity-training.md:
##########
@@ -0,0 +1,64 @@
+---
+title: "Per Entity Training"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Per Entity Training
+The aim of this pipeline example is to demonstrate per entity training in 
Beam. Per entity training refers to the process of training a machine learning 
model for each individual entity, rather than training a single model for all 
entities. In this approach, a separate model is trained for each entity based 
on the data specific to that entity. Per entity training can be beneficial in 
scenarios:

Review Comment:
   Change "in scenarios:" to "in the following scenarios:"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] rszper commented on a diff in pull request #25081: Add working example for Per Entity Training

Reply via email to