James Medel created MINIFICPP-1199:
--------------------------------------
Summary: Integrates MiNiFi C++ with H2O's Driverless AI To Do ML
Inference on Edge
Key: MINIFICPP-1199
URL: https://issues.apache.org/jira/browse/MINIFICPP-1199
Project: Apache NiFi MiNiFi C++
Issue Type: New Feature
Affects Versions: master
Environment: Ubuntu 18.04 in AWS EC2
MiNiFi C++ 0.7.0
Reporter: James Medel
Fix For: master
*MiNiFi C++ and H2O Driverless AI Integration* via Custom Python Processors:
Integrates MiNiFi C++ with H2O's Driverless AI by Using Driverless AI's Python
Scoring Pipeline and MiNiFi's Custom Python Processors. Uses the Python
Processors to execute the Python Scoring Pipeline scorer to do batch scoring
and real-time scoring for one or more predicted labels on test data in the
incoming flow file content. I would like to contribute my processors to MiNiFi
C++ as a new feature.
*3 custom python processors* created for MiNiFi:
*H2oPspScoreRealTime* - Executes H2O Driverless AI's Python Scoring Pipeline to
do interactive scoring (real-time) scoring on an individual row or list of test
data within each incoming flow file. Uses H2O's open-source Datatable library
to load test data into a frame, then converts it to pandas dataframe. Pandas is
used to convert the pandas dataframe rows to a list of lists, but since each
flow file passing through this processor should have only 1 row, we extract the
1st list. Then that list is passed into the Driverless AI's Python
scorer.score() function to predict one or more predicted labels. The prediction
is returned to a list. The number of predicted labels is specified when the
user built the Python Scoring Pipeline in Driverless AI. With that knowledge,
there is a property for the user to pass in one or more predicted label names
that will be used as the predicted header. I create a comma separated string
using the predicted header and predicted value. The predicted header(s) is on
one line followed by a newline and the predicted value(s) is on the next line
followed by a newline. The string is written to the flow file content. Flow
File attributes are added to the flow file for the number of lists scored and
the predicted label name and its associated score. Finally, the flow file is
transferred on a success relationship.
*H2oPspScoreBatches* - Executes H2O Driverless AI's Python Scoring Pipeline to
do batch scoring on a frame of data within each incoming flow file. Uses H2O's
open-source Datatable library to load test data into a frame. Each frame from
the flow file passing through this processor should have multiple rows. That
frame is passed into the Driverless AI's Python scorer.score_batch() function
to predict one or more predicted labels. The prediction is returned to a pandas
dataframe, then that dataframe is converted to a string, so it can be written
to the flow file content. Flow File attributes are added to the flow file for
the number of rows scored. There are also flow file attributes added for the
predicted label name and its associated score for the first row in the frame.
Finally, the flow file is transferred on a success relationship.
*ConvertDsToCsv* - Converts data source of incoming flow file to csv. Uses
H2O's open-source Datatable library to load data into a frame, then converts it
to pandas dataframe. Pandas is used to convert the pandas dataframe to a csv
and store it into in-memory text stream StringIO without pandas dataframe
index. The csv string data is grabbed using file read() function on the
StringIO object, so it can be written to the flow file content. The flow file
is transferred on a success relationship.
*Hydraulic System Condition Monitoring* Data used in MiNiFi Flow:
The sensor test data I used in this integration comes from [Kaggle: Condition
Monitoring of Hydraulic
Systems|[https://www.kaggle.com/jjacostupa/condition-monitoring-of-hydraulic-systems#description.txt]].
I was able to predict hydraulic system cooling efficiency through MiNiFi and
H2O integration described above. This use case here is hydraulic system
predictive maintenance.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)