Hi all,

As a bit of background, right now, GeoIP data is loaded into and managed by
MySQL (the connectors are LGPL licensed and we need to sever our Maven
dependency on it before next release). We currently depend on and install
an instance of MySQL (in each of the Management Pack, Ansible, and Docker
installs). In the topology, we use the JDBCAdapter to connect to MySQL and
query for a given IP.  Additionally, it's a single point of failure for
that particular enrichment right now.  If MySQL is down, geo enrichment
can't occur.

I'm proposing that we eliminate the use of MySQL entirely, through all
installation paths (which, unless I missed some, includes Ansible, the
Ambari Management Pack, and Docker).  We'd do this by dropping all the
various MySQL setup and management through the code, along with all the
DDL, etc.  The JDBCAdapter would stay, so that anybody who wants to setup
their own databases for enrichments and install connectors is able to do so.

In its place, I've looked at using MapDB, which is a really easy to use
library for creating Java collections backed by a file (This is NOT a
separate installation of anything, it's just a jar that manages interaction
with the file system).  Given the slow churn of the GeoIP files (I believe
they get updated once a week), we can have a script that can be run when
needed, downloads the MaxMind tar file, builds the MapDB file that will be
used by the bolts, and places it into HDFS.  Finally, we update a config to
point to the new file, the bolts get the updated config callback and can
update their db files.  Inside the code, we wrap the MapDB portions to make
it transparent to downstream code.

The particularly nice parts about using MapDB are that its ease of use plus
it offers the utilities we need out of the box to be able to support the
operations we need on this (Keep in mind the GeoIP files use IP ranges and
we need to be able to easily grab the appropriate range).

The main point of concern I have about this is that when we grab the HDFS
file during an update, given that multiple JVMs can be running, we don't
want them to clobber each other. I believe this can be avoided by simply
using each worker's working directory to store the file (and appropriately
ensure threads on the same JVM manage multithreading).  This should keep
the JVMs (and the underlying DB files) entirely independent.

This script would get called by the various installations during startup to
do the initial setup.  After install, it can then be called on demand in
order.

At this point, we should be all set, with everything running and updatable.

Justin

Reply via email to