Github user cestella commented on the issue:

    https://github.com/apache/incubator-metron/pull/449
  
    @nickwallen Yep, I see what you mean.  I think we had different 
interpretations of "user focused."
    
    I think where I landed here on what I'd like can be broken down into a 
near-term, medium-term and a long-term vision for the profiler.
    
    **Near Term**
    In the near term, for this PR, we need the ability to:
    * Write from the profiler to kafka so we can triage the output of the 
profiler
    * Adjust the representations of the data the profiler writes based on the 
places it writes to
      * For HBase, any kryo serialized object
      * For Kafka, any fundamental structure (e.g. number, string) or Map of 
fundamental structures.
    * Be backwards compatible with the current syntax.
    
    Strictly speaking, I'll adopt your approach here of separating 
representation by its destination, where that destination is restricted to the 
possible destinations inside of Metron's current architecture.  So called 
*destination-focused*, rather than separating representation by its storage 
mechanism, where those storage mechanisms are restricted to the possible 
mechanisms that we support in Metron.  So called *writer-focused*
    
    In the following examples, every tick, the following happens:
    * 1 message is written to HBase with the stats function
    * 1 message is written to Kafka with a message that looks like this:
    ```
    {
      'profile' : 'test',
      'entity' : 'global'
      'mean' : ####,
      'stddev' : ####
    }
    ```
    
    This looks like: 
    ```
    {
      "profiles": [
        {
          "profile": "test",
          "foreach": "'global'",
          "onlyif": "source.type == 'squid'",
          "update":  { "stats": "STATS_ADD(stats, LENGTH(url))" },
          "result":  {
             "profile" : "stats",
             "triage" : "{ 'mean' : STATS_MEAN(stats), 'stddev' : 
STATS_SD(stats) }"
           }     
        }
      ]
    }
    ```
    
    **Medium Term**
    
    This gets expanded to allow for multiple elements written per profile.
    
    In the following examples, every tick, the following happens:
    * 2 message is written to HBase for profile `test`
      * entity: `global:stats`
      * entity: `global:count`
    * 2 message is written to Kafka with a message that looks like this:
    ```
    {
      'profile' : 'test',
      'entity' : 'global'
      'result_type' : 'baseline_stats'
      'mean' : ####,
      'stddev' : ####
    }
    ```
    and
    ```
    {
      'profile' : 'test',
      'entity' : 'global'
      'result_type' : 'kurtosis'
      'kurtosis' : ####
    }
    ```
    
    This looks like:
    ```
    {
      "profiles": [
        {
          "profile": "test",
          "foreach": "'global'",
          "onlyif": "source.type == 'squid'",
          "update":  { "stats": "STATS_ADD(stats, LENGTH(url))" },
          "result":  {
             "profile" : {
                          "stats" :  "stats",
                          "count" : "STATS_COUNT(stats)"
             "triage" : {
                          "baseline_stats" : "{ 'mean' : STATS_MEAN(stats), 
'stddev' : STATS_SD(stats) }",
                          "kurtosis" : "STATS_KURTOSIS(stats)"
                           }
                       }   
        }
      ]
    }
    ```
    
    
    **Longer Term**
    
    This is where, in my mind, the writer-focused morphs into 'writer 
configuration' focused, which is to say, not just the transport, but also the 
destination.  In this world, we can directly associate the representation of 
the things we're writing from the profiler with the destination.  Our point of 
configuration for new writers in Metron is the `MessageWriter` and 
`BulkMessageWriter` interfaces.  We recently pulled out the configs into their 
own indexing configs, keyed by writer (kafka, elasticsearch, etc).  Imagine 
that the writers are configured entirely there and that it's not 
writer-oriented, but use-case oriented.  Instead of what we have now in the 
indexing config, we can make it:
    ```
    {
      "writers" : {
         "kafka" : {
          "batchSize" : 1,
          "enabled" : true
         },
       "hbase_profile" : {
          "batchSize" : 5,
          "enabled" : true
       }
       },
      "endpoints" : {
         "triage" : {
             "writer" : "kafka",
             "queue" : "enrichments"
                         },
          "profile" : {
             "writer" : "hbase_profile",
             "table" : "profile:P"
          }
      }
    }
    ```
    
    here, the two forms merge into one because we can represent using our core 
abstractions the capability-driven design that you are focused on, @nickwallen 
.  In this world, the profiler is simple, it just writes messages out to the 
indexing topology.  The structure looks of the tuple looks like:
    * message
    * endpoint
    
    The indexing topology will use the source type to pull the config and, 
since the endpoint is specified in the tuple, it will use the endpoint to write 
the message to the appropriately configured destination.
    
    In this world, the example in the medium term does not change:
    ```
    {
      "profiles": [
        {
          "profile": "test",
          "foreach": "'global'",
          "onlyif": "source.type == 'squid'",
          "update":  { "stats": "STATS_ADD(stats, LENGTH(url))" },
          "result":  {
             "profile" : {
                          "stats" :  "stats",
                          "count" : "STATS_COUNT(stats)"
             "triage" : {
                          "baseline_stats" : "{ 'mean' : STATS_MEAN(stats), 
'stddev' : STATS_SD(stats) }",
                          "kurtosis" : "STATS_KURTOSIS(stats)"
                           }
                       }   
        }
      ]
    }
    ```
    `profile` and `triage` are interpreted to mean endpoint names, who can be 
looked up in the indexing configuration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to