GitHub user cestella reopened a pull request: https://github.com/apache/metron/pull/879
METRON-1378: Create a summarizer ## Contributor Comments We have a nice and generalized infrastructure for loading data into HBase and interacting with it via `flatfile_loader.sh` and `ENRICHMENT_GET()`. It is also useful to summarize a set of data into a static data structure, store it on HDFS and interact with it via stellar. To this end, to complement the `flatfile_loader.sh`, we should have a `flatfile_summarizer.sh` that, using the same extractor config, will process a flat file and output a serialized object. The usecase for this is as follows: Let's say that I have a static list of domains in the second column of a CSV, domains.csv, and I want to generate a bloom filter with those domains in them sans TLD. I should be able to create a file called `bloom.ser` with the serialized bloom filter given the extractor config: ``` { "config" : { "columns" : { "rank" : 0, "domain" : 1 }, "value_transform" : { "domain" : "DOMAIN_REMOVE_TLD(domain)" }, "value_filter" : "LENGTH(domain) > 0", "state_init" : "BLOOM_INIT()", "state_update" : { "state" : "BLOOM_ADD(state, domain)" }, "state_merge" : "BLOOM_MERGE(states)", "separator" : "," }, "extractor" : "CSV" } ``` Note, the associated stellar function `OBJECT_GET` is available in #880. # Testing Plan We should run the test plan for #445 to ensure no regressions since 80% of this PR is just refactoring existing abstractions to reuse. ## Write out a String Locally We are going to take the top 10k alexa domains (saved as part of #445 's test plan to `~/top-10k.csv`) * Keep a running sample of 20 samples per thread * At the end, merge the samples and get a random domain from the merged samples * Write out the sample ### Test * Create a file `~/extractor_sample.json` with the following contents: ``` { "config" : { "columns" : { "rank" : 0, "domain" : 1 }, "value_transform" : { "domain" : "DOMAIN_REMOVE_TLD(domain)" }, "value_filter" : "LENGTH(domain) > 0", "state_init" : "SAMPLE_INIT(20)", "state_update" : { "state" : "SAMPLE_ADD(state, domain)" }, "state_merge" : "GET_FIRST(SAMPLE_GET(SAMPLE_MERGE(states, SAMPLE_INIT(1))))", "separator" : "," }, "extractor" : "CSV" } ``` * Summarize via `$METRON_HOME//bin/flatfile_summarizer.sh -i ~/top-10k.csv -o ~/sample.ser -e ./extractor_sample.json -p 5 -b 128` * Execute `hexdump -C ./sample.ser` and ensure that there is a string in there. It may end or start with some non-ascii bytes at the beginning and end. e.g. ``` [root@node1 ~]# hexdump -C ./sample.ser 00000000 03 01 37 63 66 6d 6e e6 |..7cfmn.| 00000008 [root@node1 ~]# cat top-10k.csv | grep 7cfmn 4696,7cfmnf.top ``` ### Typosquatting Use-case Testing You can also follow the testing plan for #882 as this code is merged into that PR and it shows how this feature can be used in a real use-case. ## Pull Request Checklist Thank you for submitting a contribution to Apache Metron. Please refer to our [Development Guidelines](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61332235) for the complete guide to follow for contributions. Please refer also to our [Build Verification Guidelines](https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds?show-miniview) for complete smoke testing guides. In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? If not one needs to be created at [Metron Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel). - [x] Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically master)? ### For code changes: - [x] Have you included steps to reproduce the behavior or problem that is being changed or addressed? - [x] Have you included steps or a guide to how the change may be verified and tested manually? - [x] Have you ensured that the full suite of tests and checks have been executed in the root metron folder via: ``` mvn -q clean integration-test install && build_utils/verify_licenses.sh ``` - [x] Have you written or updated unit tests and or integration tests to verify your changes? - [x] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [x] Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent? ### For documentation related changes: - [x] Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via `site-book/target/site/index.html`: ``` cd site-book mvn site ``` #### Note: Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible. It is also recommended that [travis-ci](https://travis-ci.org) is set up for your personal repository such that your branches are built there before submitting a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cestella/incubator-metron flatfile_object_gen Alternatively you can review and apply these changes as the patch at: https://github.com/apache/metron/pull/879.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #879 ---- commit 9c492c4540534fa72550aff330ce6c588f640965 Author: cstella <cestella@...> Date: 2017-12-21T15:17:18Z flatfile summarizer initial commit. commit 15681143e86913a692777770d0a89e1c877e3d99 Author: cstella <cestella@...> Date: 2017-12-21T18:50:58Z typo commit 935d4d2933e7156219722e54cec5dfce228fdbcc Author: cstella <cestella@...> Date: 2017-12-21T21:17:23Z Updating tests and docs. commit afe91c341608468e2637db4a02f9428ebe19353a Author: cstella <cestella@...> Date: 2017-12-21T21:18:20Z more docs. commit d955e26cf4e7776642e83b23deb305fd5a238cc2 Author: cstella <cestella@...> Date: 2017-12-21T21:46:30Z Renamed test. commit ac3c612cd6fd7140a14fac9692000f04b65ecc83 Author: cstella <cestella@...> Date: 2017-12-22T12:23:04Z Adding a ToString writer. commit 34cdb55f6c43049151c5b5242a73a09119de31ef Author: cstella <cestella@...> Date: 2017-12-22T15:10:15Z Renamed to console writer commit b3e4408ab98d69866774bae452e9cc47efc4fbdd Author: cstella <cestella@...> Date: 2017-12-22T15:14:43Z newline issue. commit 767e4976a723451c92ff7bbceffafd5c38086c19 Author: cstella <cestella@...> Date: 2017-12-23T15:32:07Z Allowing empty outputs commit b4e40a4e47ddc6ff871ef0e95b433fb4315f8e34 Author: cstella <cestella@...> Date: 2017-12-23T16:07:10Z Missed a compilation error. commit 3ed05682372b10aa544f7fbba8a93d7dca78ca25 Author: cstella <cestella@...> Date: 2018-01-08T14:32:34Z Merge branch 'master' into flatfile_object_gen ---- ---