damccorm commented on code in PR #24965: URL: https://github.com/apache/beam/pull/24965#discussion_r1091071810
########## sdks/python/apache_beam/examples/inference/README.md: ########## @@ -374,3 +374,59 @@ True Price 31000000.0, Predicted Price 25654277.256461 ... ``` +## Iris Classification + +[`xgboost_iris_classification.py.py`](./xgboost_iris_classification.py.py) contains an implementation for a RunInference pipeline that performs classification on tabular data from the [Iris Dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html). + +The pipeline reads rows that contain the features of a given iris. The features are Sepal Length, Sepal Width, Petal Length and Petal Width. The pipeline passes those features to the XGBoost implementation of RunInference which writes the iris type predictions to a text file. + +### Dataset and model for language modeling + +To use this transform, you need to have sklearn installed. The dataset is loaded from using sklearn. The `_train_model` function can be used to train a simple classifier. The function outputs it's configuration in a file that can be loaded by the `XGBoostModelHandler`. + +### Training a simple classifier + +The following function you to train a simple classifier using the sklearn Iris dataset. The trained model will be saved in the location passed as a parameter and can then later be loaded in an pipeline using the `XGBoostModelHandler`. + +``` +def _train_model(model_state_output_path: str = '/tmp/model.json', seed=999): + """Function to train an XGBoost Classifier using the sklearn Iris dataset""" + dataset = load_iris() + x_train, _, y_train, _ = train_test_split( + dataset['data'], dataset['target'], test_size=.2, random_state=seed) + booster = xgboost.XGBClassifier( + n_estimators=2, max_depth=2, learning_rate=1, objective='binary:logistic') + booster.fit(x_train, y_train) + booster.save_model(model_state_output_path) + return booster +``` + +#### Running the Pipeline +To run locally, use the following command: + +``` +python -m apache_beam.examples.inference.xgboost_iris_classification.py \ + --input_type INPUT_TYPE \ + --output OUTPUT_FILE \ + -- model_state MODEL_STATE_JSON \ + [--no_split|--split] +``` + +For example: + +``` +python -m apache_beam.examples.inference.xgboost_iris_classification.py \ + --input_type numpy \ + --output predictions.txt \ + --model_state model_state.json \ + --split +``` + +This writes the output to the `predictions.txt` with contents like: +``` +0,[1] +1,[2] +2,[1] +3,[0] +... +``` Review Comment: Could you please add a quick description of what these predictions map to? (I think they are row number + iris class?) ########## sdks/python/apache_beam/examples/inference/README.md: ########## @@ -374,3 +374,59 @@ True Price 31000000.0, Predicted Price 25654277.256461 ... ``` +## Iris Classification + +[`xgboost_iris_classification.py.py`](./xgboost_iris_classification.py.py) contains an implementation for a RunInference pipeline that performs classification on tabular data from the [Iris Dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html). + +The pipeline reads rows that contain the features of a given iris. The features are Sepal Length, Sepal Width, Petal Length and Petal Width. The pipeline passes those features to the XGBoost implementation of RunInference which writes the iris type predictions to a text file. + +### Dataset and model for language modeling + +To use this transform, you need to have sklearn installed. The dataset is loaded from using sklearn. The `_train_model` function can be used to train a simple classifier. The function outputs it's configuration in a file that can be loaded by the `XGBoostModelHandler`. + +### Training a simple classifier + +The following function you to train a simple classifier using the sklearn Iris dataset. The trained model will be saved in the location passed as a parameter and can then later be loaded in an pipeline using the `XGBoostModelHandler`. Review Comment: ```suggestion The following function allows you to train a simple classifier using the sklearn Iris dataset. The trained model will be saved in the location passed as a parameter and can then later be loaded in an pipeline using the `XGBoostModelHandler`. ``` ########## sdks/python/tox.ini: ########## @@ -326,3 +326,16 @@ commands = # Run all PyTorch unit tests # Allow exit code 5 (no tests run) so that we can run this command safely on arbitrary subdirectories. /bin/sh -c 'pytest -o junit_suite_name={envname} --junitxml=pytest_{envname}.xml -n 6 -m uses_pytorch {posargs}; ret=$?; [ $ret = 5 ] && exit 0 || exit $ret' + +[testenv:py{37,38,39,310}-xgboost-{160,170}] +deps = + -r build-requirements.txt + 160: torch>=1.6.0,<1.7.0 + 170: torch>=1.7.0 Review Comment: Instead of torch, this should be installing xgboost, right? I think we also need to be installing datatable here, right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
