[
https://issues.apache.org/jira/browse/METRON-283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15831741#comment-15831741
]
ASF GitHub Bot commented on METRON-283:
---------------------------------------
GitHub user justinleet opened a pull request:
https://github.com/apache/incubator-metron/pull/421
METRON-283 Migrate Geo Enrichment outside of MySQL
## MySQL Removed
Drops MySQL entirely from the project. This is done for a couple reasons
outlined in a discussion thread on the dev lists. They boil down to a
combination of licensing, eliminating a single point of failure, and using
MaxMind's libraries for handling GeoLite2 data, and a couple other concerns.
This PR includes dependencies, installation paths, READMEs, etc. The only
places left are MySQLConfig and MySQLConfigTest for if anyone wants to use
them. The vast majority of removed files / code are simply from stripping this
out. If any traces of MySQL outside of this are found in review, they should
almost certainly be removed.
This moves to a system based on using [MaxMind's binary
DB](http://dev.maxmind.com/geoip/geoip2/geolite2/) directly. Per their page:
> The GeoLite2 databases are distributed under the Creative Commons
Attribution-ShareAlike 4.0 International License.
Our LICENSE file has been updated with the notification that we include use
their database (and some of their [test
data](https://github.com/maxmind/MaxMind-DB), which is CCA ShareAlike 3.0).
Both of these licenses are acceptable for us as stated on [Apache
Legal](https://www.apache.org/legal/resolved#cc-sa).
Let me know if that notification should be in a different spot or spots,
and I can adjust it appropriately.
## GeoLite2 Database
The main portion of the PR is in `GeoLiteDatabaseTest.java`, which manages
access to the GeoLite2 database.
### Raw Database
The raw database is stored on HDFS. By default, it will be in
`/apps/metron/geo`. If no explicit location is given,
`/apps/metron/geo/default/<db_filename>` will be used. Otherwise, updates will
use `/apps/metron/geo/<millis>/<db_filename>`. Given the low rate of churn on
the DB (updated once per week) and the potential for replay use cases , I
haven't implemented any pruning or anything fancy on top of this.
#### Updating DB
A script is provided for updates `geo_enrichment_load.sh` in
`metron-data-management`. Usage details are provided in
`metron-data-management/README.md`. Note that the original didn't appear to
have update capabilities,
The script will pull down a new instance of GeoLite2 database. This
location can be either their standard web address (or somewhere else hosted),
or even a file:// URL. Once the db file is pulled down, it will push to the
appropriate HDFS location. Finally, it will pull down and update the global
config with the new location. This will not require a topology restart.
Note that there have been conversations about how we manage config updates
(specifically leaning towards Ambari). This has not been finalized, and we
have two non Ambari testing environments (quickdev and docker-metron) so this
just hits ZK. Ambari is not updated based on this script, and it is the user's
responsibility to update global.json.
This leads to a questions people may have preferences on
- Do we want the script to always update? Should there be a flag to stage
the file, but not update configs?
### Code
It is a singleton that allows for the database to be updated when a global
config is updated. It is (hopefully!) correctly locked to avoid threading
issues when updating or reading from the DB (and I've been able to update
without issues.
The various Bolts have been updated to make sure they initialize the
adapter to have it grab the current data appropriately.
In addition, a Stellar function has been provided GEO_GET(), which takes an
IPV4 address. It probably works with an IPV6 address, but I didn't really dig
into it, given that the goal was to initially match parity.
Given the somewhat core nature of this, and my relative unfamiliarity going
in with how all these pieces tie together, I'm definitely looking for feedback
on how things are implemented, or if I missed conventions we've used in the
code.
## Testing
Unit testing is added for the database and Stellar portions of the code as
needed. The DB testing uses one of MaxMind's test DB's that they've published,
because we can't create the binary format correctly. It does not use the full
(20+ MB) version of the data, but rather a stripped down version (on the order
of several KB).
Three environments were tested during this. Having these three disparate
environments make features that cut across like this more complicated to test,
so additional scrutiny would be merited (I would definitely like at least one
person to run through one of these themselves and make sure it's transparent).
Notably, quickdev requires Ansible setup scripts to align; the mpack requires
layout, internal configuration, and handling of additional files and ownership
of scripts to work properly; and docker-metron requires essentially cheating
the scripts and just running a wget on the file because things aren't actually
setup.
- quickdev
- Ambari Management Pack
- docker-metron
### QuickDev
Ansible scripts are updated. Running data through topologies kicked out
the data.
### Ambari Management Pack
RPMs updated where needed. Config Screen layout changed, updates made to
properly handle configs and ownership. Ran Stellar on this install. Again,
ran data through the topology.
### Docker
Essentially this just involved cheating the scripts and running a wget on
the GeoLite2 dbfile, because there's no Hadoop. Ran through the instructions
to run the topologies (which are a little different than the others because
Docker) and again was able to get data back out.
## Additional Notes
- As noted above, do we want the DB script to always update? Should there
be a flag to stage the file, but not update configs? I primarily see this
affecting the mpack because of the Ambari management behind it.
- Increased `withMaxTimeMS` in the indexing integration test. This seems
unrelated to my changes ( and I believe had been seen elsewhere), so if anybody
has found the root cause, I can adjust my code appropriately.
- LocID doesn't technically exist in the new data, and I suspect it was
never meant to be relied upon anyway outside of being a join key. The same
applies to the new field that is replacing it in this context. It seems like
we were mostly just passing that field along because it was available, and it
seems like it should be refactored to be more useful. I didn't take on that
analysis here, it's the slightly more validated version of a gut feeling.
- The newer form of the MaxMind info has more data available than the old
source we were using. We should also consider passing (at least some) of this
data along. See MaxMind's [What's New in
GeoIp2](http://dev.maxmind.com/geoip/geoip2/whats-new-in-geoip2/). One of the
ones that leapt out at me as potentially interesting was a field containing
where an IP was registered, rather than just where the IP actually is. Another
is fields for `is_anonymous_proxy` key, etc. I didn't validate if everything
new is in the free version of the dataset.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/justinleet/incubator-metron geo_mmdb
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-metron/pull/421.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #421
----
commit fe5e6b87e18ed08f6136bb59feb525ba50b978cd
Author: justinjleet <[email protected]>
Date: 2016-12-22T17:25:32Z
Drop MySQL and use the GeoLite2 databases instead
commit b907d26f04fa5d8f4af50ecd1b8850aab6241e77
Author: justinjleet <[email protected]>
Date: 2017-01-18T13:04:37Z
Updating unit test
commit 1b434ba8942e56f43ac05027c9bdd920ce135672
Author: justinjleet <[email protected]>
Date: 2017-01-19T04:00:38Z
Update with MaxMind license
commit dd14dbf397458f26b62ec6638c0a39d06756b15c
Author: justinjleet <[email protected]>
Date: 2017-01-19T06:57:41Z
geo url fix
commit 09b33b75a3690269aa88609a96f1ef7d7ea6112f
Author: justinjleet <[email protected]>
Date: 2017-01-19T07:07:17Z
fixing docker
commit c9cbb23ebaef0a03811d789af52448662db95cb0
Author: justinjleet <[email protected]>
Date: 2017-01-19T07:08:39Z
fixing Ansible after adjusting default path
commit 859106a7fffbbb47c86ea2d55b2a3600cc4b595d
Author: justinjleet <[email protected]>
Date: 2017-01-19T13:58:45Z
Fixing metron-docker
commit b6bfc16cf2776272491914f0da55ea19a74c4006
Author: justinjleet <[email protected]>
Date: 2017-01-19T14:05:47Z
Update docs
commit 6760c3e95cbefd97098b39ac385edfbf36633dc4
Author: justinjleet <[email protected]>
Date: 2017-01-19T14:13:47Z
updating stellar function and readme
commit 46e50bb7b5789811fa0e3545dbd4c3c7b480d079
Author: justinjleet <[email protected]>
Date: 2017-01-19T14:18:42Z
Adding a couple unit tests and cleaning up Stellar function results
commit fc856f6b88bdb7191dd0f8fa7d7fd136433ea22e
Author: justinjleet <[email protected]>
Date: 2017-01-19T14:23:02Z
Updating Stellar docs
commit 6df268d0e5313435b6c320b7cbd1ec180bf92c92
Author: justinjleet <[email protected]>
Date: 2017-01-19T20:24:36Z
Adding note to readme about script interaction with Ambari
----
> Migrate Geo Enrichment outside of MySQL
> ---------------------------------------
>
> Key: METRON-283
> URL: https://issues.apache.org/jira/browse/METRON-283
> Project: Metron
> Issue Type: Improvement
> Reporter: James Sirota
> Assignee: Justin Leet
> Priority: Minor
>
> We need to migrate our enrichment SQL store from MySQL to Phoenix or some
> other SQL on Hbase library. Or alternatively come up with a way to do this
> without using SQL. This way we don't have a dependency on MySQL and there is
> one less thing that we need to install on our platform
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)