[GitHub] metron pull request #879: METRON-1378: Create a summarizer

justinleet Mon, 08 Jan 2018 09:45:46 -0800

Github user justinleet commented on a diff in the pull request:

    https://github.com/apache/metron/pull/879#discussion_r160179259
  
    --- Diff: metron-platform/metron-data-management/README.md ---
    @@ -354,3 +357,91 @@ The parameters for the utility are as follows:
     | -r         | --remote_dir        | No           | HDFS directory to land 
formatted GeoIP file - defaults to /apps/metron/geo/\<epoch millis\>/     |
     | -t         | --tmp_dir           | No           | Directory for landing 
the temporary GeoIP data - defaults to /tmp                                |
     | -z         | --zk_quorum         | Yes          | Zookeeper Quorum URL 
(zk1:port,zk2:port,...)                                                     |
    +
    +### Flatfile Summarizer
    +
    +The shell script `$METRON_HOME/bin/flatfile_summarizer.sh` will read data 
from local disk, HDFS or URLs and generate a summary object.
    +The object will be serialized and written to disk, either HDFS or local 
disk depending on the output mode specified.
    +
    +It should be noted that this utility uses the same extractor config as the 
`flatfile_loader.sh`,
    +but as the output target is not a key value store (but rather a summary 
object), it is not necessary
    +to specify certain configs:
    +* `indicator`, `indicator_filter` and `indicator_transform` are not 
required, but will be executed if present.
    +As in the loader, there will be an indicator field available if you so 
specify it (by using `indicator` in the config).
    +* `type` is neither required nor used
    +
    +Indeed, some new configs are expected:
    +* `state_init` : Executed once to initialize the state object (the object 
written out).
    +* `state_update`: Called once per message.  The fields available are the 
fields for the row as well as
    +  * `indicator` - the indicator value if you've specified it in the config
    +  * `state` - the current state.  Useful for adding to the state (e.g. 
`BLOOM_ADD(state, val)` where `val` is the name of a field).
    +* `state_merge` : If you are running this multi-threaded and your objects 
can be merged, this is the statement that will
    +merge the state objects created per thread.  There is a special field 
available to this config:
    +  * `states` - a list of the state objects
    +
    +One special thing to note here is that there is a special configuration
    +parameter to the Extractor config that is only considered during this
    +loader:
    +* inputFormat : This specifies how to consider the data.  The two 
implementations are `BY_LINE` and `WHOLE_FILE`.
    +
    +The default is `BY_LINE`, which makes sense for a list of CSVs where
    +each line indicates a unit of information which can be imported.
    +However, if you are importing a set of STIX documents, then you want
    +each document to be considered as input to the Extractor.
    +
    +#### Example
    +
    +Consider the possibility that you want to generate a bloom filter with all 
of the domains in a CSV structured similarly to
    +the Alexa top 1M domains, so the columns are:
    +* rank
    +* domain name
    +
    +You want to generate a bloom filter with just the domains, not considering 
the TLD.
    +You would execute the following to:
    +* read data from `./top-1m.csv`
    +* write data to `./filter.ser`
    +* use 5 threads
    +
    +```
    +$METRON_HOME/bin/flatfile_summarizer.sh -i ./top-1m.csv -o ./filter.ser -e 
./extractor.json -p 5 -b 128
    +```
    +
    +To configure this, `extractor.json` would look like:
    +```
    +{
    +  "config" : {
    +    "columns" : {
    +      "rank" : 0,
    +      "domain" : 1
    +    },
    +    "value_transform" : {
    +      "domain" : "DOMAIN_REMOVE_TLD(domain)"
    +    },
    +    "value_filter" : "LENGTH(domain) > 0",
    +    "state_init" : "BLOOM_INIT()",
    +    "state_update" : {
    +      "state" : "BLOOM_ADD(state, domain)"
    +    },
    +    "state_merge" : "BLOOM_MERGE(states)",
    +    "separator" : ","
    +  },
    +  "extractor" : "CSV"
    +}
    +```
    +
    +#### Parameters
    +
    +The parameters for the utility are as follows:
    +
    +| Short Code | Long Code           | Is Required? | Description            
                                                                                
                                                                             |
    
+|------------|---------------------|--------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
    +| -h         |                     | No           | Generate the help 
screen/set of options                                                           
                                                                                
  |
    +| -q         | --quiet             | No           | Do not update progress 
                                                                                
                                                                             |
    +| -e         | --extractor_config  | Yes          | JSON Document 
describing the extractor for this input data source                             
                                                                                
      |
    +| -m         | --import_mode       | No           | The Import mode to 
use: LOCAL, MR.  Default: LOCAL                                                 
                                                                                
 |
    +| -om        | --output_mode       | No           | The Output mode to 
use: LOCAL, HDFS.  Default: LOCAL                                               
                                                                                
   |
    +| -i         | --input             | Yes          | The input data 
location on local disk.  If this is a file, then that file will be loaded.  If 
this is a directory, then the files will be loaded recursively under that 
directory.  |
    +| -i         | --output            | Yes          | The output data 
location.    |
    --- End diff --
    
    -o here, not -i

---

[GitHub] metron pull request #879: METRON-1378: Create a summarizer

Reply via email to