Github user cestella commented on the issue: https://github.com/apache/incubator-metron/pull/449 @nickwallen Yep, I see what you mean. I think we had different interpretations of "user focused." I think where I landed here on what I'd like can be broken down into a near-term, medium-term and a long-term vision for the profiler. **Near Term** In the near term, for this PR, we need the ability to: * Write from the profiler to kafka so we can triage the output of the profiler * Adjust the representations of the data the profiler writes based on the places it writes to * For HBase, any kryo serialized object * For Kafka, any fundamental structure (e.g. number, string) or Map of fundamental structures. * Be backwards compatible with the current syntax. Strictly speaking, I'll adopt your approach here of separating representation by its destination, where that destination is restricted to the possible destinations inside of Metron's current architecture. So called *destination-focused*, rather than separating representation by its storage mechanism, where those storage mechanisms are restricted to the possible mechanisms that we support in Metron. So called *writer-focused* In the following examples, every tick, the following happens: * 1 message is written to HBase with the stats function * 1 message is written to Kafka with a message that looks like this: ``` { 'profile' : 'test', 'entity' : 'global' 'mean' : ####, 'stddev' : #### } ``` This looks like: ``` { "profiles": [ { "profile": "test", "foreach": "'global'", "onlyif": "source.type == 'squid'", "update": { "stats": "STATS_ADD(stats, LENGTH(url))" }, "result": { "profile" : "stats", "triage" : "{ 'mean' : STATS_MEAN(stats), 'stddev' : STATS_SD(stats) }" } } ] } ``` **Medium Term** This gets expanded to allow for multiple elements written per profile. In the following examples, every tick, the following happens: * 2 message is written to HBase for profile `test` * entity: `global:stats` * entity: `global:count` * 2 message is written to Kafka with a message that looks like this: ``` { 'profile' : 'test', 'entity' : 'global' 'result_type' : 'baseline_stats' 'mean' : ####, 'stddev' : #### } ``` and ``` { 'profile' : 'test', 'entity' : 'global' 'result_type' : 'kurtosis' 'kurtosis' : #### } ``` This looks like: ``` { "profiles": [ { "profile": "test", "foreach": "'global'", "onlyif": "source.type == 'squid'", "update": { "stats": "STATS_ADD(stats, LENGTH(url))" }, "result": { "profile" : { "stats" : "stats", "count" : "STATS_COUNT(stats)" "triage" : { "baseline_stats" : "{ 'mean' : STATS_MEAN(stats), 'stddev' : STATS_SD(stats) }", "kurtosis" : "STATS_KURTOSIS(stats)" } } } ] } ``` **Longer Term** This is where, in my mind, the writer-focused morphs into 'writer configuration' focused, which is to say, not just the transport, but also the destination. In this world, we can directly associate the representation of the things we're writing from the profiler with the destination. Our point of configuration for new writers in Metron is the `MessageWriter` and `BulkMessageWriter` interfaces. We recently pulled out the configs into their own indexing configs, keyed by writer (kafka, elasticsearch, etc). Imagine that the writers are configured entirely there and that it's not writer-oriented, but use-case oriented. Instead of what we have now in the indexing config, we can make it: ``` { "writers" : { "kafka" : { "batchSize" : 1, "enabled" : true }, "hbase_profile" : { "batchSize" : 5, "enabled" : true } }, "endpoints" : { "triage" : { "writer" : "kafka", "queue" : "enrichments" }, "profile" : { "writer" : "hbase_profile", "table" : "profile:P" } } } ``` here, the two forms merge into one because we can represent using our core abstractions the capability-driven design that you are focused on, @nickwallen . In this world, the profiler is simple, it just writes messages out to the indexing topology. The structure looks of the tuple looks like: * message * endpoint The indexing topology will use the source type to pull the config and, since the endpoint is specified in the tuple, it will use the endpoint to write the message to the appropriately configured destination. In this world, the example in the medium term does not change: ``` { "profiles": [ { "profile": "test", "foreach": "'global'", "onlyif": "source.type == 'squid'", "update": { "stats": "STATS_ADD(stats, LENGTH(url))" }, "result": { "profile" : { "stats" : "stats", "count" : "STATS_COUNT(stats)" "triage" : { "baseline_stats" : "{ 'mean' : STATS_MEAN(stats), 'stddev' : STATS_SD(stats) }", "kurtosis" : "STATS_KURTOSIS(stats)" } } } ] } ``` `profile` and `triage` are interpreted to mean endpoint names, who can be looked up in the indexing configuration.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---