Github user justinleet commented on a diff in the pull request: https://github.com/apache/metron/pull/879#discussion_r160179259 --- Diff: metron-platform/metron-data-management/README.md --- @@ -354,3 +357,91 @@ The parameters for the utility are as follows: | -r | --remote_dir | No | HDFS directory to land formatted GeoIP file - defaults to /apps/metron/geo/\<epoch millis\>/ | | -t | --tmp_dir | No | Directory for landing the temporary GeoIP data - defaults to /tmp | | -z | --zk_quorum | Yes | Zookeeper Quorum URL (zk1:port,zk2:port,...) | + +### Flatfile Summarizer + +The shell script `$METRON_HOME/bin/flatfile_summarizer.sh` will read data from local disk, HDFS or URLs and generate a summary object. +The object will be serialized and written to disk, either HDFS or local disk depending on the output mode specified. + +It should be noted that this utility uses the same extractor config as the `flatfile_loader.sh`, +but as the output target is not a key value store (but rather a summary object), it is not necessary +to specify certain configs: +* `indicator`, `indicator_filter` and `indicator_transform` are not required, but will be executed if present. +As in the loader, there will be an indicator field available if you so specify it (by using `indicator` in the config). +* `type` is neither required nor used + +Indeed, some new configs are expected: +* `state_init` : Executed once to initialize the state object (the object written out). +* `state_update`: Called once per message. The fields available are the fields for the row as well as + * `indicator` - the indicator value if you've specified it in the config + * `state` - the current state. Useful for adding to the state (e.g. `BLOOM_ADD(state, val)` where `val` is the name of a field). +* `state_merge` : If you are running this multi-threaded and your objects can be merged, this is the statement that will +merge the state objects created per thread. There is a special field available to this config: + * `states` - a list of the state objects + +One special thing to note here is that there is a special configuration +parameter to the Extractor config that is only considered during this +loader: +* inputFormat : This specifies how to consider the data. The two implementations are `BY_LINE` and `WHOLE_FILE`. + +The default is `BY_LINE`, which makes sense for a list of CSVs where +each line indicates a unit of information which can be imported. +However, if you are importing a set of STIX documents, then you want +each document to be considered as input to the Extractor. + +#### Example + +Consider the possibility that you want to generate a bloom filter with all of the domains in a CSV structured similarly to +the Alexa top 1M domains, so the columns are: +* rank +* domain name + +You want to generate a bloom filter with just the domains, not considering the TLD. +You would execute the following to: +* read data from `./top-1m.csv` +* write data to `./filter.ser` +* use 5 threads + +``` +$METRON_HOME/bin/flatfile_summarizer.sh -i ./top-1m.csv -o ./filter.ser -e ./extractor.json -p 5 -b 128 +``` + +To configure this, `extractor.json` would look like: +``` +{ + "config" : { + "columns" : { + "rank" : 0, + "domain" : 1 + }, + "value_transform" : { + "domain" : "DOMAIN_REMOVE_TLD(domain)" + }, + "value_filter" : "LENGTH(domain) > 0", + "state_init" : "BLOOM_INIT()", + "state_update" : { + "state" : "BLOOM_ADD(state, domain)" + }, + "state_merge" : "BLOOM_MERGE(states)", + "separator" : "," + }, + "extractor" : "CSV" +} +``` + +#### Parameters + +The parameters for the utility are as follows: + +| Short Code | Long Code | Is Required? | Description | +|------------|---------------------|--------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| -h | | No | Generate the help screen/set of options | +| -q | --quiet | No | Do not update progress | +| -e | --extractor_config | Yes | JSON Document describing the extractor for this input data source | +| -m | --import_mode | No | The Import mode to use: LOCAL, MR. Default: LOCAL | +| -om | --output_mode | No | The Output mode to use: LOCAL, HDFS. Default: LOCAL | +| -i | --input | Yes | The input data location on local disk. If this is a file, then that file will be loaded. If this is a directory, then the files will be loaded recursively under that directory. | +| -i | --output | Yes | The output data location. | --- End diff -- -o here, not -i
---