The Profiler currently persists data within an HBase table. There are some extension points that allow anyone to plug-in a different row key or column structure [1]. The default implementation organize the data in what should be a scalable form for most use cases.
We currently have functionality to retrieve Profile data via a Java API [2] and a Stellar API [3]. The primary use case for both of these is model scoring. The Stellar API pairs nicely with the Metron MaaS functionality or any model scoring that would be done on streaming data within Metron. Q. How do I access this data for model training? The default implementation, while scalable, means that it is very difficult to pull the data from HBase using a generic HBase connector for a third-party platform. How can we make this data most easily accessible for model training in Spark, R, etc? *A1:* Create custom connectors for a variety of third-party platforms like Spark, R, etc. *A2:* Provide an alternate persistence layer for the Profiler. This data makes the most sense in a TSDB (Time Series Database). It is much more likely that third-party platforms will have connectors already available for a TSDB like OpenTSDB or InfluxDB. *A3: Something else that is way better*??? -- [1] See org.apache.metron.profiler.hbase.RowKeyBuilder, ColumnBuilder [2] See org.apache.metron.profiler.client.hbase.HBaseProfilerClient [3] See org.apache.metron.profiler.client.stellar.ProfileGet -- Nick Allen <[email protected]>
