[ 
https://issues.apache.org/jira/browse/METRON-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857114#comment-15857114
 ] 

ASF GitHub Bot commented on METRON-706:
---------------------------------------

GitHub user mmiklavc opened a pull request:

    https://github.com/apache/incubator-metron/pull/445

    METRON-706: Add Stellar transformations and filters to enrichment and 
threat intel loaders

    This PR completes work in https://issues.apache.org/jira/browse/METRON-706
    
    (Note: there are commits from @cestella that I had merged in the process of 
working on this. They are squashed in master but show up here. They only show 
in the commit history, not the diff)
    
    Motivation for this PR is to expand where we expose Stellar capabilities. 
This work enables transformations and filtering on enrichment and threatintel 
extractors. The user is now able to specify transformation expressions on the 
column values and separately filter records based on a provided predicate. The 
same can also be done independently for the key indicator value used as part of 
the HBase key. In addition, a new property has been added to the configuration 
that allows a user to specify a Zookeeper quorum and reference global 
properties specified in the global config.
    
    See the updated README for documentation details on the new properties.
    
    **Testing**
    
    Testing follows closely with the methods defined in 
[#432](https://github.com/apache/incubator-metron/pull/432#issuecomment-276733075)
    
    * Download the Alexa top 1m data set
    ```
    wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
    unzip top-1m.csv.zip
    ```
    
    * Stage import file
    ```
    head -n 10000 top-1m.csv > top-10k.csv
    head -n 10 top-1m.csv > top-10.csv
    ```
    
    * Create an extractor.json for the CSV data by editing extractor.json and 
pasting in these contents. (Set your zk_quorum to your own value if different 
from the default Vagrant quick-dev environment):
    ```
    {
      "config" : {
        "zk_quorum" : "node1:2181",
        "columns" : {
           "rank" : 0,
           "domain" : 1
        },
        "value_transform" : {
           "domain" : "DOMAIN_REMOVE_TLD(domain)",
           "port" : "es.port"
        },
        "value_filter" : "LENGTH(domain) > 0",
        "indicator_column" : "domain",
        "indicator_transform" : {
           "indicator" : "DOMAIN_REMOVE_TLD(indicator)"
        },
        "indicator_filter" : "LENGTH(indicator) > 0",
        "type" : "top_domains",
        "separator" : ","
      },
      "extractor" : "CSV"
    }
    ```
    
    The "port" property/variable here is referencing "es.port" from the global 
config.
    
    * Run the import (parallelism of 5, batch size of 128)
    ```
    echo "truncate 'enrichment'" | hbase shell && 
/usr/metron/0.3.0/bin/flatfile_loader.sh -i ./top-10k.csv -t enrichment -c t -e 
./extractor.json -p 5 -b 128 && echo "count 'enrichment'" | hbase shell
    ```
    
    You should see 9275 records in HBase. (Less than the perhaps expected 10k)
    
    * Now run it again on the top-10 set.
    ```
    echo "truncate 'enrichment'" | hbase shell && 
/usr/metron/0.3.0/bin/flatfile_loader.sh -i ./top-10.csv -t enrichment -c t -e 
./extractor.json -p 5 -b 128 && echo "count 'enrichment'" | hbase shell
    ```
    
    You should get 9 values as below:
    ```
    scan 'enrichment'
    ROW                                                                     
COLUMN+CELL
     \x09\x00\x0F,\x10\xE5\xD1\xDE_\xBF\x9E\xA7d\xF2\xA8\x94\x00\x0Btop_dom 
column=t:v, timestamp=1486513090953, 
value={"port":"9300","domain":"yahoo","rank":"5"}
     ains\x00\x05yahoo
     \x11\xCA\xCF\x01\xB4\xC5\x11@\x0C\xA1A,\xE9j~O\x00\x0Btop_domains\x00\ 
column=t:v, timestamp=1486513090979, 
value={"port":"9300","domain":"tmall","rank":"10"}
     x05tmall
     \x13)`\xFC\xF2\xBF\xF9\xC1a\xC8a\xF1h\x0E\xB5\x11\x00\x0Btop_domains\x 
column=t:v, timestamp=1486513090930, 
value={"port":"9300","domain":"youtube","rank":"2"}
     00\x07youtube
     1\xC2I\x05k\xEA\x0EY\xE1\xAD\xA0$U\xA9kc\x00\x0Btop_domains\x00\x06goo 
column=t:v, timestamp=1486513090964, 
value={"port":"9300","domain":"google","rank":"7"}
     gle
     =\xDD\xDFH\x95\xC0\xB9\xD9\xBAKX\x8B\x9B2T\x9F\x00\x0Btop_domains\x00\ 
column=t:v, timestamp=1486513090942, 
value={"port":"9300","domain":"facebook","rank":"3"}
     x08facebook
     D\xDE\x1C\x9A\xCF\x07S\x9A\xDEB\xDB\x87D\x1F\x1D\xF4\x00\x0Btop_domain 
column=t:v, timestamp=1486513090974, 
value={"port":"9300","domain":"qq","rank":"9"}
     s\x00\x02qq
     u\xBC\xFC\xC9\x09\x9Af\xE1\xC8\xA5\x9A\x93\xCB0c\x01\x00\x0Btop_domain 
column=t:v, timestamp=1486513090970, 
value={"port":"9300","domain":"amazon","rank":"8"}
     s\x00\x06amazon
     \xC7\xA5.l\xC21\xFAQ8\x1E\x5C\x99p\x93_\x9A\x00\x0Btop_domains\x00\x09 
column=t:v, timestamp=1486513090958, 
value={"port":"9300","domain":"wikipedia","rank":"6"}
     wikipedia
     \xCC\xCA\xBF;\x92\xA1\xA0k\xE4\x83i\xBD\xC3\xA8\xE8p\x00\x0Btop_domain 
column=t:v, timestamp=1486513090948, 
value={"port":"9300","domain":"baidu","rank":"4"}
     s\x00\x05baidu
    ```
    
    Once again, we get fewer than the original dataset size. This is because 
multiple records are mapping to the same resulting keys in HBase.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mmiklavc/incubator-metron top-domains-merge

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-metron/pull/445.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #445
    
----
commit 64a2fc6ee1190776bcbb46ecf6841b58ce2bf311
Author: Michael Miklavcic <[email protected]>
Date:   2017-01-25T21:38:08Z

    save some work and notes

commit a6a6ab64e2777610ff57727195d3ce0d2c2c8cb1
Author: Michael Miklavcic <[email protected]>
Date:   2017-01-27T14:25:54Z

    Extraction done

commit 47d814ef95d67738d20ce5dc530ba7b05d418a96
Author: cstella <[email protected]>
Date:   2017-01-27T23:15:44Z

    Multithreading the SimpleEnrichmentFlatFileLoader

commit 918d4ce4aea5d7dfde992f32bf049c70f35dd182
Author: cstella <[email protected]>
Date:   2017-01-27T23:23:19Z

    doc changes.

commit c6ca3a86881eb77bc9598a61e3c0cf8280ccb03f
Author: cstella <[email protected]>
Date:   2017-01-27T23:39:56Z

    Updating docs.

commit 8c9a79cdfa38ea2fbd161095d5e346147558ec5f
Author: cstella <[email protected]>
Date:   2017-01-28T03:36:31Z

    Investigating integration tests.

commit 315bd181aa634290ab987441d81c28addb7952e2
Author: cstella <[email protected]>
Date:   2017-01-28T04:09:28Z

    Update integration test to be a proper integration test.

commit 004c6f41b6c1cc3ecea70513e1a468501bd32e3c
Author: cstella <[email protected]>
Date:   2017-01-28T04:49:37Z

    Adding spliterator unit test for completeness

commit f8dd48ef920c948e1fc5ff736e386f641e551b2b
Author: cstella <[email protected]>
Date:   2017-01-28T05:01:42Z

    Updating test to use a proper file

commit 9b04f9723d442c8f4fb7a8bcaa1d733fc1305dc4
Author: cstella <[email protected]>
Date:   2017-01-28T05:17:12Z

    Updating docs and renaming a few things.

commit eb5b82cc35bd767a169f548ea8144dd9ae165f84
Author: cstella <[email protected]>
Date:   2017-01-28T05:23:25Z

    Update one more test case.

commit 81c42afa2ff619ca23bfa5ec546c94ee8d6063e5
Author: Michael Miklavcic <[email protected]>
Date:   2017-01-30T16:09:52Z

    partial commit - adding additional filter and transform for indicator

commit 310c98bd946b2fdb320193cce85d368f016bf8c3
Author: cstella <[email protected]>
Date:   2017-01-30T20:36:23Z

    Merge branch 'master' into unified_loader

commit 3f6e3ba4f30e41c94ff25027f1fd7c839ea6c9bf
Author: cstella <[email protected]>
Date:   2017-01-31T15:39:03Z

    Updating simple enrichment flat file loader to be complete.

commit 2bdaf419621704970159e75e202acfeb868c3571
Author: Michael Miklavcic <[email protected]>
Date:   2017-01-31T20:16:10Z

    Merge branch 'master' into top-domains

commit 79cfdb4fba5e82e9e170bfc77c7133e6646f9787
Author: cstella <[email protected]>
Date:   2017-01-31T22:12:05Z

    Removing old threatintel_bulk_load.sh script and integrating into the 
flatfile load script

commit bf7756b52e66907ca23a576ba9be9ab40b33f77d
Author: cstella <[email protected]>
Date:   2017-01-31T22:22:17Z

    Forgot licenses.

commit e5729a296bdbef6d2d3ee87c69aade396708f47d
Author: Michael Miklavcic <[email protected]>
Date:   2017-02-01T00:16:06Z

    Merge with master. Get indicator transforms and filter working

commit a104f464e6b882121c7ab44079a5570d282c8457
Author: cstella <[email protected]>
Date:   2017-02-01T00:28:46Z

    updating script.

commit b121e13d892834865847ddd806cbf10da63fa44e
Author: cstella <[email protected]>
Date:   2017-02-01T00:34:28Z

    Merge branch 'master' into unified_loader

commit b5a9e5a9243576b27d59e959dfab3e99d34eb761
Author: cstella <[email protected]>
Date:   2017-02-01T00:57:02Z

    Added gzip and zip to regular files

commit 323267ddfb52ab1aa7488e02643a8158044797e2
Author: cstella <[email protected]>
Date:   2017-02-01T15:04:53Z

    Fixed stupid zip issue.

commit bc26b5b3992b91097bb4fc4b214d4b6bacaddfbb
Author: cstella <[email protected]>
Date:   2017-02-01T16:27:58Z

    Updating readme and making progress bar optional and better.

commit 6cdf35d94f72be7da524fd5f854876f131ddb9f9
Author: cstella <[email protected]>
Date:   2017-02-01T17:39:59Z

    updating tests to include gzip and zip

commit fd718bffa5e97f2c5c510b38d6a6d3812aefbed9
Author: Michael Miklavcic <[email protected]>
Date:   2017-02-01T18:57:04Z

    Refactor

commit d24f0c974d27e3861cb431c48efb3380a372e58b
Author: Michael Miklavcic <[email protected]>
Date:   2017-02-02T19:03:56Z

    Get unit test for extractor decorator working

commit d9bb54ec27a0f3282d28ba40d043f0045c167a54
Author: Michael Miklavcic <[email protected]>
Date:   2017-02-02T21:47:08Z

    Add negative test cases. Refactor options as enum in extractor decorator

commit 43c09c810c7d7cfa05cffa4609edab7ba2f24492
Author: Michael Miklavcic <[email protected]>
Date:   2017-02-03T18:10:34Z

    Intermediate commit - need to fetch from PR432

commit eafc786250d9b8e6283bd71c91bbd270ba4d1311
Author: Michael Miklavcic <[email protected]>
Date:   2017-02-03T18:52:03Z

    Get integration tests for flat file loader working with my branch. Fix 
trampled commit for ExtractorHandler

commit ad1aef760948109565b7144479151312ebccc24d
Author: Michael Miklavcic <[email protected]>
Date:   2017-02-03T19:46:05Z

    Get integration tests working for Stellar transformations in the file loader

----


> Add Stellar transformations and filters to enrichment and threat intel loaders
> ------------------------------------------------------------------------------
>
>                 Key: METRON-706
>                 URL: https://issues.apache.org/jira/browse/METRON-706
>             Project: Metron
>          Issue Type: Improvement
>            Reporter: Michael Miklavcic
>            Assignee: Michael Miklavcic
>
> This Jira tracks work to add the ability to transform and filter data being 
> loaded into the enrichment and threatintel HBase tables.
> This effort builds on the work in:
> https://issues.apache.org/jira/browse/METRON-678
> and
> https://issues.apache.org/jira/browse/METRON-682



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to