I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see it as an
HBase enrichment. If our current caching isn't enough to mitigate the above
issues, we have a problem, don't we? Or do we not recommend HBase
enrichment for per message enrichment in general?

Also- can you elaborate on how MapDB would not require a network hop?
Doesn't this mean we would have to sync the enrichment data to each Storm
supervisor? HDFS could (probably would) have a network hop too, no?

Fwiw -
"In its place, I've looked at using MapDB, which is a really easy to use
library for creating Java collections backed by a file (This is NOT a
separate installation of anything, it's just a jar that manages interaction
with the file system).  Given the slow churn of the GeoIP files (I believe
they get updated once a week), we can have a script that can be run when
needed, downloads the MaxMind tar file, builds the MapDB file that will be
used by the bolts, and places it into HDFS.  Finally, we update a config to
point to the new file, the bolts get the updated config callback and can
update their db files.  Inside the code, we wrap the MapDB portions to make
it transparent to downstream code."

Seems a bit more complex than "refresh the hbase table". Afaik, either
approach would require some sort of translation between GeoIP source format
and target format, so that part is a wash imo.

So, I'd really like to see, at least, an attempt to leverage HBase
enrichment.

-D...


On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <ceste...@gmail.com> wrote:

> I think that it's a sensible thing to use MapDB for the geo enrichment.
> Let me state my reasoning:
>
>    - An HBase implementation  would necessitate a HBase scan possibly
>    hitting HDFS, which is expensive per-message.
>    - An HBase implementation would necessitate a network hop and MapDB
>    would not.
>
> I also think this might be the beginning of a more general purpose support
> in Stellar for locally shipped, read-only MapDB lookups, which might be
> interesting.
>
> In short, all quotes about premature optimization are sure to apply to my
> reasoning, but I can't help but have my spidey senses tingle when we
> introduce a scan-per-message architecture.
>
> Casey
>
> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <dima.koval...@sstech.us>
> wrote:
>
> > Hello Justin,
> >
> > Considering that Metron uses hbase tables for storing enrichment and
> > threatintel feeds, can we use Hbase for geo enrichment as well?
> > Or MapDB can be used for enrichment and threatintel feeds instead of
> hbase?
> >
> > - Dima
> >
> > On 01/16/2017 04:17 PM, Justin Leet wrote:
> > > Hi all,
> > >
> > > As a bit of background, right now, GeoIP data is loaded into and
> managed
> > by
> > > MySQL (the connectors are LGPL licensed and we need to sever our Maven
> > > dependency on it before next release). We currently depend on and
> install
> > > an instance of MySQL (in each of the Management Pack, Ansible, and
> Docker
> > > installs). In the topology, we use the JDBCAdapter to connect to MySQL
> > and
> > > query for a given IP.  Additionally, it's a single point of failure for
> > > that particular enrichment right now.  If MySQL is down, geo enrichment
> > > can't occur.
> > >
> > > I'm proposing that we eliminate the use of MySQL entirely, through all
> > > installation paths (which, unless I missed some, includes Ansible, the
> > > Ambari Management Pack, and Docker).  We'd do this by dropping all the
> > > various MySQL setup and management through the code, along with all the
> > > DDL, etc.  The JDBCAdapter would stay, so that anybody who wants to
> setup
> > > their own databases for enrichments and install connectors is able to
> do
> > so.
> > >
> > > In its place, I've looked at using MapDB, which is a really easy to use
> > > library for creating Java collections backed by a file (This is NOT a
> > > separate installation of anything, it's just a jar that manages
> > interaction
> > > with the file system).  Given the slow churn of the GeoIP files (I
> > believe
> > > they get updated once a week), we can have a script that can be run
> when
> > > needed, downloads the MaxMind tar file, builds the MapDB file that will
> > be
> > > used by the bolts, and places it into HDFS.  Finally, we update a
> config
> > to
> > > point to the new file, the bolts get the updated config callback and
> can
> > > update their db files.  Inside the code, we wrap the MapDB portions to
> > make
> > > it transparent to downstream code.
> > >
> > > The particularly nice parts about using MapDB are that its ease of use
> > plus
> > > it offers the utilities we need out of the box to be able to support
> the
> > > operations we need on this (Keep in mind the GeoIP files use IP ranges
> > and
> > > we need to be able to easily grab the appropriate range).
> > >
> > > The main point of concern I have about this is that when we grab the
> HDFS
> > > file during an update, given that multiple JVMs can be running, we
> don't
> > > want them to clobber each other. I believe this can be avoided by
> simply
> > > using each worker's working directory to store the file (and
> > appropriately
> > > ensure threads on the same JVM manage multithreading).  This should
> keep
> > > the JVMs (and the underlying DB files) entirely independent.
> > >
> > > This script would get called by the various installations during
> startup
> > to
> > > do the initial setup.  After install, it can then be called on demand
> in
> > > order.
> > >
> > > At this point, we should be all set, with everything running and
> > updatable.
> > >
> > > Justin
> > >
> >
> >
>

Reply via email to