[GitHub] [beam] damccorm commented on a diff in pull request #24965: XGBoost modelhandler implementation

via GitHub Tue, 31 Jan 2023 07:21:52 -0800


damccorm commented on code in PR #24965:
URL: https://github.com/apache/beam/pull/24965#discussion_r1091071810



##########
sdks/python/apache_beam/examples/inference/README.md:
##########
@@ -374,3 +374,59 @@ True Price 31000000.0, Predicted Price 25654277.256461
 ...
 ```
 
+## Iris Classification
+
+[`xgboost_iris_classification.py.py`](./xgboost_iris_classification.py.py) 
contains an implementation for a RunInference pipeline that performs 
classification on tabular data from the [Iris 
Dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html).
+
+The pipeline reads rows that contain the features of a given iris. The 
features are Sepal Length, Sepal Width, Petal Length and Petal Width. The 
pipeline passes those features to the XGBoost implementation of RunInference 
which writes the iris type predictions to a text file.
+
+### Dataset and model for language modeling
+
+To use this transform, you need to have sklearn installed. The dataset is 
loaded from using sklearn. The `_train_model` function can be used to train a 
simple classifier. The function outputs it's configuration in a file that can 
be loaded by the `XGBoostModelHandler`.
+
+### Training a simple classifier
+
+The following function you to train a simple classifier using the sklearn Iris 
dataset. The trained model will be saved in the location passed as a parameter 
and can then later be loaded in an pipeline using the `XGBoostModelHandler`.
+
+```
+def _train_model(model_state_output_path: str = '/tmp/model.json', seed=999):
+  """Function to train an XGBoost Classifier using the sklearn Iris dataset"""
+  dataset = load_iris()
+  x_train, _, y_train, _ = train_test_split(
+      dataset['data'], dataset['target'], test_size=.2, random_state=seed)
+  booster = xgboost.XGBClassifier(
+      n_estimators=2, max_depth=2, learning_rate=1, 
objective='binary:logistic')
+  booster.fit(x_train, y_train)
+  booster.save_model(model_state_output_path)
+  return booster
+```
+
+#### Running the Pipeline
+To run locally, use the following command:
+
+```
+python -m apache_beam.examples.inference.xgboost_iris_classification.py \
+  --input_type INPUT_TYPE \
+  --output OUTPUT_FILE \
+  -- model_state MODEL_STATE_JSON \
+  [--no_split|--split]
+```
+
+For example:
+
+```
+python -m apache_beam.examples.inference.xgboost_iris_classification.py \
+  --input_type numpy \
+  --output predictions.txt \
+  --model_state model_state.json \
+  --split
+```
+
+This writes the output to the `predictions.txt` with contents like:
+```
+0,[1]
+1,[2]
+2,[1]
+3,[0]
+...
+```

Review Comment:
   Could you please add a quick description of what these predictions map to? 
(I think they are row number + iris class?)



##########
sdks/python/apache_beam/examples/inference/README.md:
##########
@@ -374,3 +374,59 @@ True Price 31000000.0, Predicted Price 25654277.256461
 ...
 ```
 
+## Iris Classification
+
+[`xgboost_iris_classification.py.py`](./xgboost_iris_classification.py.py) 
contains an implementation for a RunInference pipeline that performs 
classification on tabular data from the [Iris 
Dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html).
+
+The pipeline reads rows that contain the features of a given iris. The 
features are Sepal Length, Sepal Width, Petal Length and Petal Width. The 
pipeline passes those features to the XGBoost implementation of RunInference 
which writes the iris type predictions to a text file.
+
+### Dataset and model for language modeling
+
+To use this transform, you need to have sklearn installed. The dataset is 
loaded from using sklearn. The `_train_model` function can be used to train a 
simple classifier. The function outputs it's configuration in a file that can 
be loaded by the `XGBoostModelHandler`.
+
+### Training a simple classifier
+
+The following function you to train a simple classifier using the sklearn Iris 
dataset. The trained model will be saved in the location passed as a parameter 
and can then later be loaded in an pipeline using the `XGBoostModelHandler`.

Review Comment:
   ```suggestion
   The following function allows you to train a simple classifier using the 
sklearn Iris dataset. The trained model will be saved in the location passed as 
a parameter and can then later be loaded in an pipeline using the 
`XGBoostModelHandler`.
   ```



##########
sdks/python/tox.ini:
##########
@@ -326,3 +326,16 @@ commands =
   # Run all PyTorch unit tests
   # Allow exit code 5 (no tests run) so that we can run this command safely on 
arbitrary subdirectories.
   /bin/sh -c 'pytest -o junit_suite_name={envname} 
--junitxml=pytest_{envname}.xml -n 6 -m uses_pytorch {posargs}; ret=$?; [ $ret 
= 5 ] && exit 0 || exit $ret'
+
+[testenv:py{37,38,39,310}-xgboost-{160,170}]
+deps =
+  -r build-requirements.txt
+  160: torch>=1.6.0,<1.7.0
+  170: torch>=1.7.0

Review Comment:
   Instead of torch, this should be installing xgboost, right? I think we also 
need to be installing datatable here, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] damccorm commented on a diff in pull request #24965: XGBoost modelhandler implementation

Reply via email to