[jira] [Commented] (METRON-283) Migrate Geo Enrichment outside of MySQL

ASF GitHub Bot (JIRA) Fri, 20 Jan 2017 05:31:27 -0800

    [ 
https://issues.apache.org/jira/browse/METRON-283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15831741#comment-15831741
 ]


ASF GitHub Bot commented on METRON-283:
---------------------------------------

GitHub user justinleet opened a pull request:

    https://github.com/apache/incubator-metron/pull/421

    METRON-283 Migrate Geo Enrichment outside of MySQL

    ## MySQL Removed
    Drops MySQL entirely from the project.  This is done for a couple reasons 
outlined in a discussion thread on the dev lists.  They boil down to a 
combination of licensing, eliminating a single point of failure, and using 
MaxMind's libraries for handling GeoLite2 data, and a couple other concerns.
    
    This PR includes dependencies, installation paths, READMEs, etc.  The only 
places left are MySQLConfig and MySQLConfigTest for if anyone wants to use 
them.  The vast majority of removed files / code are simply from stripping this 
out.  If any traces of MySQL outside of this are found in review, they should 
almost certainly be removed.
    
    This moves to a system based on using [MaxMind's binary 
DB](http://dev.maxmind.com/geoip/geoip2/geolite2/) directly. Per their page:
    
    > The GeoLite2 databases are distributed under the Creative Commons 
Attribution-ShareAlike 4.0 International License. 
    
    Our LICENSE file has been updated with the notification that we include use 
their database (and some of their [test 
data](https://github.com/maxmind/MaxMind-DB), which is CCA ShareAlike 3.0).  
Both of these licenses are acceptable for us as stated on [Apache 
Legal](https://www.apache.org/legal/resolved#cc-sa).
    
    Let me know if that notification should be in a different spot or spots, 
and I can adjust it appropriately.
    
    ## GeoLite2 Database
    The main portion of the PR is in `GeoLiteDatabaseTest.java`, which manages 
access to the GeoLite2 database.
    
    ### Raw Database
    The raw database is stored on HDFS.  By default, it will be in 
`/apps/metron/geo`.  If no explicit location is given, 
`/apps/metron/geo/default/<db_filename>` will be used.  Otherwise, updates will 
use `/apps/metron/geo/<millis>/<db_filename>`.  Given the low rate of churn on 
the DB (updated once per week) and the potential for replay use cases , I 
haven't implemented any pruning or anything fancy on top of this.
    
    #### Updating DB
    A script is provided for updates `geo_enrichment_load.sh` in 
`metron-data-management`.  Usage details are provided in 
`metron-data-management/README.md`.  Note that the original didn't appear to 
have update capabilities, 
    
    The script will pull down a new instance of GeoLite2 database.  This 
location can be either their standard web address (or somewhere else hosted), 
or even a file:// URL.  Once the db file is pulled down, it will push to the 
appropriate HDFS location.  Finally, it will pull down and update the global 
config with the new location.  This will not require a topology restart.
    
    Note that there have been conversations about how we manage config updates 
(specifically leaning towards Ambari).  This has not been finalized, and we 
have two non Ambari testing environments (quickdev and docker-metron) so this 
just hits ZK.  Ambari is not updated based on this script, and it is the user's 
responsibility to update global.json.
    
    This leads to a questions people may have preferences on
    
    - Do we want the script to always update? Should there be a flag to stage 
the file, but not update configs?
    
    ### Code
    It is a singleton that allows for the database to be updated when a global 
config is updated.  It is (hopefully!) correctly locked to avoid threading 
issues when updating or reading from the DB (and I've been able to update 
without issues.
    
    The various Bolts have been updated to make sure they initialize the 
adapter to have it grab the current data appropriately.
    
    In addition, a Stellar function has been provided GEO_GET(), which takes an 
IPV4 address.  It probably works with an IPV6 address, but I didn't really dig 
into it, given that the goal was to initially match parity.
    
    Given the somewhat core nature of this, and my relative unfamiliarity going 
in with how all these pieces tie together, I'm definitely looking for feedback 
on how things are implemented, or if I missed conventions we've used in the 
code.
    
    ## Testing
    Unit testing is added for the database and Stellar portions of the code as 
needed.  The DB testing uses one of MaxMind's test DB's that they've published, 
because we can't create the binary format correctly.  It does not use the full 
(20+ MB) version of the data, but rather a stripped down version (on the order 
of several KB).
    
    Three environments were tested during this.  Having these three disparate 
environments make features that cut across like this more complicated to test, 
so additional scrutiny would be merited (I would definitely like at least one 
person to run through one of these themselves and make sure it's transparent).  
Notably, quickdev requires Ansible setup scripts to align; the mpack requires 
layout, internal configuration, and handling of additional files and ownership 
of scripts to work properly; and docker-metron requires essentially cheating 
the scripts and just running a wget on the file because things aren't actually 
setup.
    
    - quickdev
    - Ambari Management Pack
    - docker-metron
    
    ### QuickDev
    Ansible scripts are updated.  Running data through topologies kicked out 
the data.
    
    ### Ambari Management Pack
    RPMs updated where needed. Config Screen layout changed, updates made to 
properly handle configs and ownership.  Ran Stellar on this install.  Again, 
ran data through the topology.
    
    ### Docker
    Essentially this just involved cheating the scripts and running a wget on 
the GeoLite2 dbfile, because there's no Hadoop.  Ran through the instructions 
to run the topologies (which are a little different than the others because 
Docker) and again was able to get data back out.
    
    ## Additional Notes
    - As noted above, do we want the DB script to always update? Should there 
be a flag to stage the file, but not update configs?  I primarily see this 
affecting the mpack because of the Ambari management behind it.
    - Increased `withMaxTimeMS` in the indexing integration test.  This seems 
unrelated to my changes ( and I believe had been seen elsewhere), so if anybody 
has found the root cause, I can adjust my code appropriately.
    - LocID doesn't technically exist in the new data, and I suspect it was 
never meant to be relied upon anyway outside of being a join key.  The same 
applies to the new field that is replacing it in this context.  It seems like 
we were mostly just passing that field along because it was available, and it 
seems like it should be refactored to be more useful.  I didn't take on that 
analysis here, it's the slightly more validated version of a gut feeling.
    - The newer form of the MaxMind info has more data available than the old 
source we were using.  We should also consider passing (at least some) of this 
data along. See MaxMind's [What's New in 
GeoIp2](http://dev.maxmind.com/geoip/geoip2/whats-new-in-geoip2/).  One of the 
ones that leapt out at me as potentially interesting was a field containing 
where an IP was registered, rather than just where the IP actually is.  Another 
is fields for `is_anonymous_proxy` key, etc.  I didn't validate if everything 
new is in the free version of the dataset.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/justinleet/incubator-metron geo_mmdb

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-metron/pull/421.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #421
    
----
commit fe5e6b87e18ed08f6136bb59feb525ba50b978cd
Author: justinjleet <[email protected]>
Date:   2016-12-22T17:25:32Z

    Drop MySQL and use the GeoLite2 databases instead

commit b907d26f04fa5d8f4af50ecd1b8850aab6241e77
Author: justinjleet <[email protected]>
Date:   2017-01-18T13:04:37Z

    Updating unit test

commit 1b434ba8942e56f43ac05027c9bdd920ce135672
Author: justinjleet <[email protected]>
Date:   2017-01-19T04:00:38Z

    Update with MaxMind license

commit dd14dbf397458f26b62ec6638c0a39d06756b15c
Author: justinjleet <[email protected]>
Date:   2017-01-19T06:57:41Z

    geo url fix

commit 09b33b75a3690269aa88609a96f1ef7d7ea6112f
Author: justinjleet <[email protected]>
Date:   2017-01-19T07:07:17Z

    fixing docker

commit c9cbb23ebaef0a03811d789af52448662db95cb0
Author: justinjleet <[email protected]>
Date:   2017-01-19T07:08:39Z

    fixing Ansible after adjusting default path

commit 859106a7fffbbb47c86ea2d55b2a3600cc4b595d
Author: justinjleet <[email protected]>
Date:   2017-01-19T13:58:45Z

    Fixing metron-docker

commit b6bfc16cf2776272491914f0da55ea19a74c4006
Author: justinjleet <[email protected]>
Date:   2017-01-19T14:05:47Z

    Update docs

commit 6760c3e95cbefd97098b39ac385edfbf36633dc4
Author: justinjleet <[email protected]>
Date:   2017-01-19T14:13:47Z

    updating stellar function and readme

commit 46e50bb7b5789811fa0e3545dbd4c3c7b480d079
Author: justinjleet <[email protected]>
Date:   2017-01-19T14:18:42Z

    Adding a couple unit tests and cleaning up Stellar function results

commit fc856f6b88bdb7191dd0f8fa7d7fd136433ea22e
Author: justinjleet <[email protected]>
Date:   2017-01-19T14:23:02Z

    Updating Stellar docs

commit 6df268d0e5313435b6c320b7cbd1ec180bf92c92
Author: justinjleet <[email protected]>
Date:   2017-01-19T20:24:36Z

    Adding note to readme about script interaction with Ambari

----


> Migrate Geo Enrichment outside of MySQL
> ---------------------------------------
>
>                 Key: METRON-283
>                 URL: https://issues.apache.org/jira/browse/METRON-283
>             Project: Metron
>          Issue Type: Improvement
>            Reporter: James Sirota
>            Assignee: Justin Leet
>            Priority: Minor
>
> We need to migrate our enrichment SQL store from MySQL to Phoenix or some 
> other SQL on Hbase library.  Or alternatively come up with a way to do this 
> without using SQL.  This way we don't have a dependency on MySQL and there is 
> one less thing that we need to install on our platform 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (METRON-283) Migrate Geo Enrichment outside of MySQL

Reply via email to