[
https://issues.apache.org/jira/browse/MINIFICPP-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Marton Szasz updated MINIFICPP-1199:
------------------------------------
Fix Version/s: (was: 0.8.0)
0.9.0
> Integrates MiNiFi C++ with H2O Driverless AI Python Scoring Pipeline To Do ML
> Inference on Edge
> -----------------------------------------------------------------------------------------------
>
> Key: MINIFICPP-1199
> URL: https://issues.apache.org/jira/browse/MINIFICPP-1199
> Project: Apache NiFi MiNiFi C++
> Issue Type: New Feature
> Affects Versions: master
> Environment: Ubuntu 18.04 in AWS EC2
> MiNiFi C++ 0.7.0
> Reporter: James Medel
> Priority: Blocker
> Fix For: 0.9.0
>
>
> *MiNiFi C++ and H2O Driverless AI Integration* via Custom Python Processors:
> Integrates MiNiFi C++ with H2O's Driverless AI by Using Driverless AI's
> Python Scoring Pipeline and MiNiFi's Custom Python Processors. Uses the
> Python Processors to execute the Python Scoring Pipeline scorer to do batch
> scoring and real-time scoring for one or more predicted labels on test data
> in the incoming flow file content. I would like to contribute my processors
> to MiNiFi C++ as a new feature.
>
> *3 custom python processors* created for MiNiFi:
> *H2oPspScoreRealTime* - Executes H2O Driverless AI's Python Scoring Pipeline
> to do interactive scoring (real-time) scoring on an individual row or list of
> test data within each incoming flow file. Uses H2O's open-source Datatable
> library to load test data into a frame, then converts it to pandas dataframe.
> Pandas is used to convert the pandas dataframe rows to a list of lists, but
> since each flow file passing through this processor should have only 1 row,
> we extract the 1st list. Then that list is passed into the Driverless AI's
> Python scorer.score() function to predict one or more predicted labels. The
> prediction is returned to a list. The number of predicted labels is specified
> when the user built the Python Scoring Pipeline in Driverless AI. With that
> knowledge, there is a property for the user to pass in one or more predicted
> label names that will be used as the predicted header. I create a comma
> separated string using the predicted header and predicted value. The
> predicted header(s) is on one line followed by a newline and the predicted
> value(s) is on the next line followed by a newline. The string is written to
> the flow file content. Flow File attributes are added to the flow file for
> the number of lists scored and the predicted label name and its associated
> score. Finally, the flow file is transferred on a success relationship.
>
> *H2oPspScoreBatches* - Executes H2O Driverless AI's Python Scoring Pipeline
> to do batch scoring on a frame of data within each incoming flow file. Uses
> H2O's open-source Datatable library to load test data into a frame. Each
> frame from the flow file passing through this processor should have multiple
> rows. That frame is passed into the Driverless AI's Python
> scorer.score_batch() function to predict one or more predicted labels. The
> prediction is returned to a pandas dataframe, then that dataframe is
> converted to a string, so it can be written to the flow file content. Flow
> File attributes are added to the flow file for the number of rows scored.
> There are also flow file attributes added for the predicted label name and
> its associated score for the first row in the frame. Finally, the flow file
> is transferred on a success relationship.
>
> *ConvertDsToCsv* - Converts data source of incoming flow file to csv. Uses
> H2O's open-source Datatable library to load data into a frame, then converts
> it to pandas dataframe. Pandas is used to convert the pandas dataframe to a
> csv and store it into in-memory text stream StringIO without pandas dataframe
> index. The csv string data is grabbed using file read() function on the
> StringIO object, so it can be written to the flow file content. The flow file
> is transferred on a success relationship.
>
> *Hydraulic System Condition Monitoring* Data used in MiNiFi Flow:
> The sensor test data I used in this integration comes from [Kaggle: Condition
> Monitoring of Hydraulic
> Systems|[https://www.kaggle.com/jjacostupa/condition-monitoring-of-hydraulic-systems#description.txt]].
> I was able to predict hydraulic system cooling efficiency through MiNiFi and
> H2O integration described above. This use case here is hydraulic system
> predictive maintenance.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)