Casey Stella created METRON-1378:
------------------------------------
Summary: Create a summarizer
Key: METRON-1378
URL: https://issues.apache.org/jira/browse/METRON-1378
Project: Metron
Issue Type: Improvement
Reporter: Casey Stella
We have a nice and generalized infrastructure for loading data into HBase and
interacting with it via `flatfile_loader.sh` and `ENRICHMENT_GET()`. It is
also useful to summarize a set of data into a static data structure, store it
on HDFS and interact with it via stellar. To this end, to complement the
`flatfile_loader.sh`, we should have a `flatfile_summarizer.sh` that, using the
same extractor config, will process a flat file and output a serialized object.
The usecase for this is as follows:
Let's say that I have a static list of domains in the second column of a CSV,
domains.csv, and I want to generate a bloom filter with those domains in them
sans TLD.
I should be able to create a file called `bloom.ser` with the serialized bloom
filter given the extractor config:
{code}
{
"config" : {
"columns" : {
"rank" : 0,
"domain" : 1
},
"value_transform" : {
"domain" : "DOMAIN_REMOVE_TLD(domain)"
},
"value_filter" : "LENGTH(domain) > 0",
"state_init" : "BLOOM_INIT()",
"state_update" : {
"state" : "BLOOM_ADD(state, domain)"
},
"state_merge" : "BLOOM_MERGE(states)",
"separator" : ","
},
"extractor" : "CSV"
}
{code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)