Re: [DISCUSS] Moving GeoIP management away from MySQL

[email protected] Mon, 16 Jan 2017 09:46:36 -0800

Re: extensibility - I am one of those enterprise users who plan to do
enrichment using their IPAM data in the next couple of months.  However,
since the information that I have is a much different format compared to
maxmind, my approach was going to make a completely separate HBase
enricher.  That also makes it easier for me to upgrade my Metron cluster in
the future, as I would not be customizing a built-in.


That said, I'm game for a follow-on enhancement, but for now this should
probably just be a replacement of what currently exists.

Jon

On Mon, Jan 16, 2017 at 12:15 PM Justin Leet <[email protected]> wrote:

> I definitely agree on checking out the MaxMind API.  I'll take a look at
> it, but at first glance it looks like it does include everything we use.
> Great find, JJ.
>
> More details on various people's points:
>
> As a note to anyone hopping in, Simon's point on the range lookup vs a key
> lookup is why it becomes a Scan in HBase vs a Get.  As an addendum to what
> Simon mentioned, denormalizing is easy enough and turns it into an easy
> range lookup.
>
> To David's point, the MapDB does require a network hop, but it's once per
> refresh of the data (Got a relevant callback? Grab new data, load it, swap
> out) instead of (up to) once per message.  I would expect the same to be
> true of the MaxMind db files.
>
> I'd also argue MapDB not really more complex than refreshing the HBase
> table, because we potentially have to start worrying about things like
> hashing and/or indices and even just general data represtation. It's
> definitely correct that the file processing has to occur on either path, so
> it really boils down to handling the callback and reloading the file vs
> handling some of the standard HBasey things.  I don't think either is an
> enormous amount of work (and both are almost certainly more work than
> MaxMind's API)
>
> Regarding extensibility, I'd argue for parity with what we have first, then
> build what we need from there.  Does anybody have any disagreement with
> that approach for right now?
>
> Justin
>
> On Mon, Jan 16, 2017 at 12:04 PM, David Lyle <[email protected]> wrote:
>
> > It is interesting- save us a ton of effort, and has the right license. I
> > think it's worth at least checking out.
> >
> > -D...
> >
> >
> > On Mon, Jan 16, 2017 at 12:00 PM, Simon Elliston Ball <
> > [email protected]> wrote:
> >
> > > I like that approach even more. That way we would only have to worry
> > about
> > > distributing the database file in binary format to all the supervisor
> > nodes
> > > on update.
> > >
> > > It would also make it easier for people to switch to the enterprise DB
> > > potentially if they had the license.
> > >
> > > One slight issue with this might be for people who wanted to extend the
> > > database. For example, organisations may want to add geo-enrichment to
> > > their own private network addresses based modified versions of the geo
> > > database. Currently we don’t really allow this, since we hard-code
> > ignoring
> > > private network classes into the geo enrichment adapter, but I can see
> a
> > > case where a global org might want to add their own ranges and
> locations
> > to
> > > the data set. Does that make sense to anyone else?
> > >
> > > Simon
> > >
> > >
> > > > On 16 Jan 2017, at 16:50, JJ Meyer <[email protected]> wrote:
> > > >
> > > > Hello all,
> > > >
> > > > Can we leverage maxmind's Java client (
> > > > https://github.com/maxmind/GeoIP2-java/tree/master/src/
> > > main/java/com/maxmind/geoip2)
> > > > in this case? I believe it can directly read maxmind file. Plus I
> think
> > > it
> > > > also has some support for caching as well.
> > > >
> > > > Thanks,
> > > > JJ
> > > >
> > > > On Mon, Jan 16, 2017 at 10:32 AM, Simon Elliston Ball <
> > > > [email protected]> wrote:
> > > >
> > > >> I like the idea of MapDB, since we can essentially pull an instance
> > into
> > > >> each supervisor, so it makes a lot of sense for relatively small
> > scale,
> > > >> relatively static enrichments in general.
> > > >>
> > > >> Generally this feels like a caching problem, and would be for a
> simple
> > > >> key-value lookup. In that case I would agree with David Lyle on
> using
> > > HBase
> > > >> as a source or truth and relying on caching.
> > > >>
> > > >> That said, GeoIP is a different lookup pattern, since it’s a range
> > > lookup
> > > >> then a key lookup (or if we denormalize the MaxMind data, just a
> range
> > > >> lookup) for that kind of thing MapDB with something like the BTree
> > > seems a
> > > >> good fit.
> > > >>
> > > >> Simon
> > > >>
> > > >>
> > > >>> On 16 Jan 2017, at 16:28, David Lyle <[email protected]> wrote:
> > > >>>
> > > >>> I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see it
> > as
> > > an
> > > >>> HBase enrichment. If our current caching isn't enough to mitigate
> the
> > > >> above
> > > >>> issues, we have a problem, don't we? Or do we not recommend HBase
> > > >>> enrichment for per message enrichment in general?
> > > >>>
> > > >>> Also- can you elaborate on how MapDB would not require a network
> hop?
> > > >>> Doesn't this mean we would have to sync the enrichment data to each
> > > Storm
> > > >>> supervisor? HDFS could (probably would) have a network hop too, no?
> > > >>>
> > > >>> Fwiw -
> > > >>> "In its place, I've looked at using MapDB, which is a really easy
> to
> > > use
> > > >>> library for creating Java collections backed by a file (This is
> NOT a
> > > >>> separate installation of anything, it's just a jar that manages
> > > >> interaction
> > > >>> with the file system).  Given the slow churn of the GeoIP files (I
> > > >> believe
> > > >>> they get updated once a week), we can have a script that can be run
> > > when
> > > >>> needed, downloads the MaxMind tar file, builds the MapDB file that
> > will
> > > >> be
> > > >>> used by the bolts, and places it into HDFS.  Finally, we update a
> > > config
> > > >> to
> > > >>> point to the new file, the bolts get the updated config callback
> and
> > > can
> > > >>> update their db files.  Inside the code, we wrap the MapDB portions
> > to
> > > >> make
> > > >>> it transparent to downstream code."
> > > >>>
> > > >>> Seems a bit more complex than "refresh the hbase table". Afaik,
> > either
> > > >>> approach would require some sort of translation between GeoIP
> source
> > > >> format
> > > >>> and target format, so that part is a wash imo.
> > > >>>
> > > >>> So, I'd really like to see, at least, an attempt to leverage HBase
> > > >>> enrichment.
> > > >>>
> > > >>> -D...
> > > >>>
> > > >>>
> > > >>> On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <[email protected]
> >
> > > >> wrote:
> > > >>>
> > > >>>> I think that it's a sensible thing to use MapDB for the geo
> > > enrichment.
> > > >>>> Let me state my reasoning:
> > > >>>>
> > > >>>>  - An HBase implementation  would necessitate a HBase scan
> possibly
> > > >>>>  hitting HDFS, which is expensive per-message.
> > > >>>>  - An HBase implementation would necessitate a network hop and
> MapDB
> > > >>>>  would not.
> > > >>>>
> > > >>>> I also think this might be the beginning of a more general purpose
> > > >> support
> > > >>>> in Stellar for locally shipped, read-only MapDB lookups, which
> might
> > > be
> > > >>>> interesting.
> > > >>>>
> > > >>>> In short, all quotes about premature optimization are sure to
> apply
> > to
> > > >> my
> > > >>>> reasoning, but I can't help but have my spidey senses tingle when
> we
> > > >>>> introduce a scan-per-message architecture.
> > > >>>>
> > > >>>> Casey
> > > >>>>
> > > >>>> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <
> > > >> [email protected]>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Hello Justin,
> > > >>>>>
> > > >>>>> Considering that Metron uses hbase tables for storing enrichment
> > and
> > > >>>>> threatintel feeds, can we use Hbase for geo enrichment as well?
> > > >>>>> Or MapDB can be used for enrichment and threatintel feeds instead
> > of
> > > >>>> hbase?
> > > >>>>>
> > > >>>>> - Dima
> > > >>>>>
> > > >>>>> On 01/16/2017 04:17 PM, Justin Leet wrote:
> > > >>>>>> Hi all,
> > > >>>>>>
> > > >>>>>> As a bit of background, right now, GeoIP data is loaded into and
> > > >>>> managed
> > > >>>>> by
> > > >>>>>> MySQL (the connectors are LGPL licensed and we need to sever our
> > > Maven
> > > >>>>>> dependency on it before next release). We currently depend on
> and
> > > >>>> install
> > > >>>>>> an instance of MySQL (in each of the Management Pack, Ansible,
> and
> > > >>>> Docker
> > > >>>>>> installs). In the topology, we use the JDBCAdapter to connect to
> > > MySQL
> > > >>>>> and
> > > >>>>>> query for a given IP.  Additionally, it's a single point of
> > failure
> > > >> for
> > > >>>>>> that particular enrichment right now.  If MySQL is down, geo
> > > >> enrichment
> > > >>>>>> can't occur.
> > > >>>>>>
> > > >>>>>> I'm proposing that we eliminate the use of MySQL entirely,
> through
> > > all
> > > >>>>>> installation paths (which, unless I missed some, includes
> Ansible,
> > > the
> > > >>>>>> Ambari Management Pack, and Docker).  We'd do this by dropping
> all
> > > the
> > > >>>>>> various MySQL setup and management through the code, along with
> > all
> > > >> the
> > > >>>>>> DDL, etc.  The JDBCAdapter would stay, so that anybody who wants
> > to
> > > >>>> setup
> > > >>>>>> their own databases for enrichments and install connectors is
> able
> > > to
> > > >>>> do
> > > >>>>> so.
> > > >>>>>>
> > > >>>>>> In its place, I've looked at using MapDB, which is a really easy
> > to
> > > >> use
> > > >>>>>> library for creating Java collections backed by a file (This is
> > NOT
> > > a
> > > >>>>>> separate installation of anything, it's just a jar that manages
> > > >>>>> interaction
> > > >>>>>> with the file system).  Given the slow churn of the GeoIP files
> (I
> > > >>>>> believe
> > > >>>>>> they get updated once a week), we can have a script that can be
> > run
> > > >>>> when
> > > >>>>>> needed, downloads the MaxMind tar file, builds the MapDB file
> that
> > > >> will
> > > >>>>> be
> > > >>>>>> used by the bolts, and places it into HDFS.  Finally, we update
> a
> > > >>>> config
> > > >>>>> to
> > > >>>>>> point to the new file, the bolts get the updated config callback
> > and
> > > >>>> can
> > > >>>>>> update their db files.  Inside the code, we wrap the MapDB
> > portions
> > > to
> > > >>>>> make
> > > >>>>>> it transparent to downstream code.
> > > >>>>>>
> > > >>>>>> The particularly nice parts about using MapDB are that its ease
> of
> > > use
> > > >>>>> plus
> > > >>>>>> it offers the utilities we need out of the box to be able to
> > support
> > > >>>> the
> > > >>>>>> operations we need on this (Keep in mind the GeoIP files use IP
> > > ranges
> > > >>>>> and
> > > >>>>>> we need to be able to easily grab the appropriate range).
> > > >>>>>>
> > > >>>>>> The main point of concern I have about this is that when we grab
> > the
> > > >>>> HDFS
> > > >>>>>> file during an update, given that multiple JVMs can be running,
> we
> > > >>>> don't
> > > >>>>>> want them to clobber each other. I believe this can be avoided
> by
> > > >>>> simply
> > > >>>>>> using each worker's working directory to store the file (and
> > > >>>>> appropriately
> > > >>>>>> ensure threads on the same JVM manage multithreading).  This
> > should
> > > >>>> keep
> > > >>>>>> the JVMs (and the underlying DB files) entirely independent.
> > > >>>>>>
> > > >>>>>> This script would get called by the various installations during
> > > >>>> startup
> > > >>>>> to
> > > >>>>>> do the initial setup.  After install, it can then be called on
> > > demand
> > > >>>> in
> > > >>>>>> order.
> > > >>>>>>
> > > >>>>>> At this point, we should be all set, with everything running and
> > > >>>>> updatable.
> > > >>>>>>
> > > >>>>>> Justin
> > > >>>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > > >>
> > >
> > >
> >
>
-- 

Jon

Sent from my mobile device

Re: [DISCUSS] Moving GeoIP management away from MySQL

Reply via email to