mmiklavc commented on issue #1525: METRON-2274 Flatfile loader and summarizer 
mapreduce mode broken
URL: https://github.com/apache/metron/pull/1525#issuecomment-538150156
 
 
   ## Test Plan
   
   Taken from:
   1. Flatfile loader - 
https://github.com/apache/metron/pull/432#issuecomment-276733075
   2. Flatfile summarizer - 
https://github.com/apache/metron/tree/master/use-cases/typosquat_detection#summarize
   
   ### Preliminaries
   
   * Spin up the dev environment for Centos 6 or 7
   * Run as root is fine
   * Root user needs a home dir in HDFS. You can do that as follows:
   ```
   sudo -u hdfs hdfs dfs -mkdir /user/root
   sudo -u hdfs hdfs dfs -chown root:root /user/root
   ```
   * Download the Alexa top 1m data set
   ```
   wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
   unzip top-1m.csv.zip
   ```
   
   * Stage import file
   ```
   head -n 10000 top-1m.csv > top-10k.csv
   hdfs dfs -put top-10k.csv /tmp
   ```
   
   * Truncate hbase
   ```
   echo "truncate 'enrichment'" | hbase shell
   ```
   
   ### Test the flatfile loader in MR mode
   
   * Create an extractor.json for the CSV data by editing `extractor.json` and 
pasting in these contents:
   ```
   {
     "config" : {
       "columns" : {
          "domain" : 1,
          "rank" : 0
                   }
       ,"indicator_column" : "domain"
       ,"type" : "alexa"
       ,"separator" : ","
                },
     "extractor" : "CSV"
   }
   ```
   
   * Import from HDFS via MR
   ```
   # import data into hbase 
   $METRON_HOME/bin/flatfile_loader.sh -i /tmp/top-10k.csv -t enrichment -c t 
-e ./extractor.json -m MR
   # count data written and verify it's 10k
   echo "count 'enrichment'" | hbase shell
   ```
   
   ### Test the flatfile summarizer in MR mode
   
   * Create an extractor-count.json file and paste the following:
   ```
   {
     "config" : {
       "columns" : {
          "rank" : 0,
          "domain" : 1
       },
       "value_transform" : {
          "domain" : "DOMAIN_REMOVE_TLD(domain)"
       },
       "value_filter" : "LENGTH(domain) > 0",
       "state_init" : "0L",
       "state_update" : {
          "state" : "state + LENGTH( DOMAIN_TYPOSQUAT( domain ))"
                        },
       "state_merge" : "REDUCE(states, (s, x) -> s + x, 0)",
       "separator" : ","
     },
     "extractor" : "CSV"
   }
   ```
   
   * Create the summary from HDFS via MR
   ```
   $METRON_HOME/bin/flatfile_summarizer.sh -i /tmp/top-10k.csv -e 
~/extractor_count.json -p 5 -om CONSOLE -m MR
   ```
   * Verify you see a count in the output similar to the following:
   ```
   Processing /root/top-10k.csv
   19/10/03 21:19:56 WARN resolver.BaseFunctionResolver: Using System 
classloader
   Processed 9999 - \
   3478276
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to