[GitHub] [nifi-minifi-cpp] james94 commented on a change in pull request #781: MINIFICPP-1214: Converts H2O Processors to use ALv2 compliant H20-3 library

GitBox Thu, 14 May 2020 09:33:55 -0700


james94 commented on a change in pull request #781:
URL: https://github.com/apache/nifi-minifi-cpp/pull/781#discussion_r424811407




##########
File path: extensions/pythonprocessors/h2o/h2o3/mojo/ExecuteH2oMojoScoring.py
##########
@@ -0,0 +1,165 @@
+#!/usr/bin/env python
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+    -- after downloading the mojo model from h2o3, the following packages
+       are needed to execute the model to do batch or real-time scoring
+
+    Make all packages available on your machine:
+
+    sudo apt-get -y update
+
+    Install Java to include open source H2O-3 algorithms:
+    
+    sudo apt-get -y install openjdk-8-jdk
+
+    Install Datatable and pandas:
+
+    pip install datatable
+    pip install pandas
+
+    Option 1: Install H2O-3 with conda
+
+    conda create -n h2o3-nifi-minifi python=3.6
+    conda activate h2o3-nifi-minifi
+    conda config --append channels conda-forge
+    conda install -y -c h2oai h2o
+
+    Option 2: Install H2O-3 with pip
+
+    pip install requests
+    pip install tabulate
+    pip install "colorama>=0.3.8"
+    pip install future
+    pip uninstall h2o
+    If on Mac OS X, must include --user:
+        pip install -f 
http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o --user
+    else:
+        pip install -f 
http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o
+
+"""
+import h2o
+import codecs
+import pandas as pd
+import datatable as dt
+
+mojo_model = None
+
+def describe(processor):
+    """ describe what this processor does
+    """
+    processor.setDescription("Executes H2O-3's MOJO Model in Python to do 
batch scoring or \
+        real-time scoring for one or more predicted label(s) on the tabular 
test data in \
+        the incoming flow file content. If tabular data is one row, then MOJO 
does real-time \
+        scoring. If tabular data is multiple rows, then MOJO does batch 
scoring.")
+
+def onInitialize(processor):
+    """ onInitialize is where you can set properties
+        processor.addProperty(name, description, defaultValue, required, el)
+    """
+    processor.addProperty("MOJO Model Filepath", "Add the filepath to the MOJO 
Model file. For example, \
+        
'path/to/mojo-model/GBM_grid__1_AutoML_20200511_075150_model_180.zip'.", "", 
True, False)
+
+    processor.addProperty("Is First Line Header", "Add True or False for 
whether first line is header.", \
+        "True", True, False)
+
+    processor.addProperty("Input Schema", "If first line is not header, then 
you must add Input Schema for \
+        incoming data.If there is more than one column name, write a comma 
separated list of \
+        column names. Else, you do not need to add an Input Schema.", "", 
False, False)
+
+    processor.addProperty("Use Output Header", "Add True or False for whether 
you want to use an output \
+        for your predictions.", "False", False, False)
+
+    processor.addProperty("Output Schema", "To set Output Schema, 'Use Output 
Header' must be set to 'True' \
+        If you want more descriptive column names for your predictions, then 
add an Output Schema. If there \
+        is more than one column name, write a comma separated list of column 
names. Else, H2O-3 will include \
+        them by default", "", False, False)
+
+def onSchedule(context):
+    """ onSchedule is where you load and read properties
+        this function is called 1 time when the processor is scheduled to run
+    """
+    global mojo_model

Review comment:
       I use global for mojo_model, so I can access mojo_model in the 
onSchdule() function and onTrigger() function. In onSchedule(), I specify 
global since the mojo_model will change in this function and my intention is to 
instantiate a mojo_model object 1 time right at the start when the processor is 
scheduled to run. This case applies to all processor instances. Then in the 
onTrigger(), we use the mojo_model to make predictions. 
   
   For example, one processor instance could instantiate a classification 
mojo_model while another processor could instantiate a regression mojo_model. 
Then we have one processor that is making classification predictions while the 
other processor is making regression predictions.
   
   With the way the current code is written, each processor will have its own 
unique mojo_model object that is global just to the entire file. 
   
   I am not going for an approach where the same global variable is accessible 
across all processor instances. I could see the answer in this StackOverflow 
post being a way to do that in Python though: 
https://stackoverflow.com/questions/13034496/using-global-variables-between-files




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [nifi-minifi-cpp] james94 commented on a change in pull request #781: MINIFICPP-1214: Converts H2O Processors to use ALv2 compliant H20-3 library

Reply via email to