Hi all, As a bit of background, right now, GeoIP data is loaded into and managed by MySQL (the connectors are LGPL licensed and we need to sever our Maven dependency on it before next release). We currently depend on and install an instance of MySQL (in each of the Management Pack, Ansible, and Docker installs). In the topology, we use the JDBCAdapter to connect to MySQL and query for a given IP. Additionally, it's a single point of failure for that particular enrichment right now. If MySQL is down, geo enrichment can't occur.
I'm proposing that we eliminate the use of MySQL entirely, through all installation paths (which, unless I missed some, includes Ansible, the Ambari Management Pack, and Docker). We'd do this by dropping all the various MySQL setup and management through the code, along with all the DDL, etc. The JDBCAdapter would stay, so that anybody who wants to setup their own databases for enrichments and install connectors is able to do so. In its place, I've looked at using MapDB, which is a really easy to use library for creating Java collections backed by a file (This is NOT a separate installation of anything, it's just a jar that manages interaction with the file system). Given the slow churn of the GeoIP files (I believe they get updated once a week), we can have a script that can be run when needed, downloads the MaxMind tar file, builds the MapDB file that will be used by the bolts, and places it into HDFS. Finally, we update a config to point to the new file, the bolts get the updated config callback and can update their db files. Inside the code, we wrap the MapDB portions to make it transparent to downstream code. The particularly nice parts about using MapDB are that its ease of use plus it offers the utilities we need out of the box to be able to support the operations we need on this (Keep in mind the GeoIP files use IP ranges and we need to be able to easily grab the appropriate range). The main point of concern I have about this is that when we grab the HDFS file during an update, given that multiple JVMs can be running, we don't want them to clobber each other. I believe this can be avoided by simply using each worker's working directory to store the file (and appropriately ensure threads on the same JVM manage multithreading). This should keep the JVMs (and the underlying DB files) entirely independent. This script would get called by the various installations during startup to do the initial setup. After install, it can then be called on demand in order. At this point, we should be all set, with everything running and updatable. Justin