james94 commented on a change in pull request #781: URL: https://github.com/apache/nifi-minifi-cpp/pull/781#discussion_r424811407
########## File path: extensions/pythonprocessors/h2o/h2o3/mojo/ExecuteH2oMojoScoring.py ########## @@ -0,0 +1,165 @@ +#!/usr/bin/env python +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + -- after downloading the mojo model from h2o3, the following packages + are needed to execute the model to do batch or real-time scoring + + Make all packages available on your machine: + + sudo apt-get -y update + + Install Java to include open source H2O-3 algorithms: + + sudo apt-get -y install openjdk-8-jdk + + Install Datatable and pandas: + + pip install datatable + pip install pandas + + Option 1: Install H2O-3 with conda + + conda create -n h2o3-nifi-minifi python=3.6 + conda activate h2o3-nifi-minifi + conda config --append channels conda-forge + conda install -y -c h2oai h2o + + Option 2: Install H2O-3 with pip + + pip install requests + pip install tabulate + pip install "colorama>=0.3.8" + pip install future + pip uninstall h2o + If on Mac OS X, must include --user: + pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o --user + else: + pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o + +""" +import h2o +import codecs +import pandas as pd +import datatable as dt + +mojo_model = None + +def describe(processor): + """ describe what this processor does + """ + processor.setDescription("Executes H2O-3's MOJO Model in Python to do batch scoring or \ + real-time scoring for one or more predicted label(s) on the tabular test data in \ + the incoming flow file content. If tabular data is one row, then MOJO does real-time \ + scoring. If tabular data is multiple rows, then MOJO does batch scoring.") + +def onInitialize(processor): + """ onInitialize is where you can set properties + processor.addProperty(name, description, defaultValue, required, el) + """ + processor.addProperty("MOJO Model Filepath", "Add the filepath to the MOJO Model file. For example, \ + 'path/to/mojo-model/GBM_grid__1_AutoML_20200511_075150_model_180.zip'.", "", True, False) + + processor.addProperty("Is First Line Header", "Add True or False for whether first line is header.", \ + "True", True, False) + + processor.addProperty("Input Schema", "If first line is not header, then you must add Input Schema for \ + incoming data.If there is more than one column name, write a comma separated list of \ + column names. Else, you do not need to add an Input Schema.", "", False, False) + + processor.addProperty("Use Output Header", "Add True or False for whether you want to use an output \ + for your predictions.", "False", False, False) + + processor.addProperty("Output Schema", "To set Output Schema, 'Use Output Header' must be set to 'True' \ + If you want more descriptive column names for your predictions, then add an Output Schema. If there \ + is more than one column name, write a comma separated list of column names. Else, H2O-3 will include \ + them by default", "", False, False) + +def onSchedule(context): + """ onSchedule is where you load and read properties + this function is called 1 time when the processor is scheduled to run + """ + global mojo_model Review comment: I use global for mojo_model, so I can access mojo_model in the onSchdule() function and onTrigger() function. In onSchedule(), I specify global since the mojo_model will change in this function and my intention is to instantiate a mojo_model object 1 time right at the start when the processor is scheduled to run. This case applies to all processor instances. Then in the onTrigger(), we use the mojo_model to make predictions. For example, one processor instance could instantiate a classification mojo_model while another processor could instantiate a regression mojo_model. Then we have one processor that is making classification predictions while the other processor is making regression predictions. With the way the current code is written, each processor will have its own unique mojo_model object that is global just to the entire file. I am not going for an approach where the same global variable is accessible across all processor instances. I could see the answer in this StackOverflow post being a way to do that in Python though: https://stackoverflow.com/questions/13034496/using-global-variables-between-files ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org