mmiklavc commented on issue #1525: METRON-2274 Flatfile loader and summarizer mapreduce mode broken URL: https://github.com/apache/metron/pull/1525#issuecomment-538150156 ## Test Plan Taken from: 1. Flatfile loader - https://github.com/apache/metron/pull/432#issuecomment-276733075 2. Flatfile summarizer - https://github.com/apache/metron/tree/master/use-cases/typosquat_detection#summarize ### Preliminaries * Spin up the dev environment for Centos 6 or 7 * Run as root is fine * Root user needs a home dir in HDFS. You can do that as follows: ``` sudo -u hdfs hdfs dfs -mkdir /user/root sudo -u hdfs hdfs dfs -chown root:root /user/root ``` * Download the Alexa top 1m data set ``` wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip unzip top-1m.csv.zip ``` * Stage import file ``` head -n 10000 top-1m.csv > top-10k.csv hdfs dfs -put top-10k.csv /tmp ``` * Truncate hbase ``` echo "truncate 'enrichment'" | hbase shell ``` ### Test the flatfile loader in MR mode * Create an extractor.json for the CSV data by editing `extractor.json` and pasting in these contents: ``` { "config" : { "columns" : { "domain" : 1, "rank" : 0 } ,"indicator_column" : "domain" ,"type" : "alexa" ,"separator" : "," }, "extractor" : "CSV" } ``` * Import from HDFS via MR ``` # import data into hbase $METRON_HOME/bin/flatfile_loader.sh -i /tmp/top-10k.csv -t enrichment -c t -e ./extractor.json -m MR # count data written and verify it's 10k echo "count 'enrichment'" | hbase shell ``` ### Test the flatfile summarizer in MR mode * Create an extractor-count.json file and paste the following: ``` { "config" : { "columns" : { "rank" : 0, "domain" : 1 }, "value_transform" : { "domain" : "DOMAIN_REMOVE_TLD(domain)" }, "value_filter" : "LENGTH(domain) > 0", "state_init" : "0L", "state_update" : { "state" : "state + LENGTH( DOMAIN_TYPOSQUAT( domain ))" }, "state_merge" : "REDUCE(states, (s, x) -> s + x, 0)", "separator" : "," }, "extractor" : "CSV" } ``` * Create the summary from HDFS via MR ``` $METRON_HOME/bin/flatfile_summarizer.sh -i /tmp/top-10k.csv -e ~/extractor_count.json -p 5 -om CONSOLE -m MR ``` * Verify you see a count in the output similar to the following: ``` Processing /root/top-10k.csv 19/10/03 21:19:56 WARN resolver.BaseFunctionResolver: Using System classloader Processed 9999 - \ 3478276 ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
