Casey Stella created METRON-1378:
------------------------------------

             Summary: Create a summarizer
                 Key: METRON-1378
                 URL: https://issues.apache.org/jira/browse/METRON-1378
             Project: Metron
          Issue Type: Improvement
            Reporter: Casey Stella


We have a nice and generalized infrastructure for loading data into HBase and 
interacting with it via `flatfile_loader.sh` and `ENRICHMENT_GET()`.  It is 
also useful to summarize a set of data into a static data structure, store it 
on HDFS and interact with it via stellar.  To this end, to complement the 
`flatfile_loader.sh`, we should have a `flatfile_summarizer.sh` that, using the 
same extractor config, will process a flat file and output a serialized object.

The usecase for this is as follows:
Let's say that I have a static list of domains in the second column of a CSV, 
domains.csv, and I want to generate a bloom filter with those domains in them 
sans TLD.

I should be able to create a file called `bloom.ser` with the serialized bloom 
filter given the extractor config:
{code}
{
  "config" : {
    "columns" : {
       "rank" : 0,
       "domain" : 1
    },
    "value_transform" : {
       "domain" : "DOMAIN_REMOVE_TLD(domain)"
    },
    "value_filter" : "LENGTH(domain) > 0",
    "state_init" : "BLOOM_INIT()",
    "state_update" : {
       "state" : "BLOOM_ADD(state, domain)"
                     },
    "state_merge" : "BLOOM_MERGE(states)",
    "separator" : ","
  },
  "extractor" : "CSV"
}
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to