Re: [DISCUSS] Moving GeoIP management away from MySQL

Michael Miklavcic Mon, 16 Jan 2017 13:37:39 -0800

I'm also in agreement on this.

On Mon, Jan 16, 2017 at 2:11 PM, Nick Allen <n...@nickallen.org> wrote:


> +1 to using the Java API with the MMDB file provided by Maxmind.  This is
> what I had thought we were doing when we discussed this a few months back.
> I'd rather use the Maxmind tools as-provided instead of engineering
> something on top of it.
>
> On Mon, Jan 16, 2017 at 3:59 PM, JJ Meyer <jjmey...@gmail.com> wrote:
>
> > Matt, I agree with your points on why we shouldn't just get rid of the
> > database just to get rid of a database. But IMO, I think we may be
> > reinventing the wheel a little bit by even putting the maxmind data into
> > MySQL. Right now we are already downloading a maxmind file. To me it
> seems
> > simpler to push the file to HDFS where we can pick it up and have the
> > maxmind client use that instead of importing data into a DB and then
> > running a query. Also, I believe the data gets updated weekly. So syncing
> > may become easier too.
> >
> > James, I believe it works with the paid and free versions of geoip. I
> know
> > NiFi uses this client library in their Geo enrichment processor.
> >
> > Also, if it is decided that using a SQL database is still the best
> > solution, I think there is a benefit to using their library. We would
> just
> > have to implement a `DatabaseProvider` that hits some SQL db instead of
> > using their standard implementation.
> >
> > Thanks,
> > JJ
> >
> > On Mon, Jan 16, 2017 at 2:27 PM, James Sirota <jsir...@apache.org>
> wrote:
> >
> > > Hi Guys, I just wanted to clarify one point that I think is lost in
> this
> > > tread.  Geo enrichment is NOT a key-value enrichment.  It requires a
> > range
> > > scan and a join (which is why it's implemented via mySql and not
> Hbase).
> > > To account for this access pattern via a key-value store you would
> > > inevitably have to do something funky or in case of Hbase I don't think
> > > there is a way to avoid doing a range scan.
> > >
> > > With respect to mapdb it only has support for Maps, Sets, Lists,
> Queues.
> > > Are we sure it provides enough functionality for us to do this
> > enrichment?
> > >
> > > With respect to the Maxmind client, are we sure we can use it on the
> > > mySql-backed version of their DB?  I thought the Maxmind database
> itself
> > is
> > > proprietary and is something you have to pay for.  My understanding is
> > that
> > > the client is designed for that proprietary version.
> > >
> > > I somewhat agree with Matt's point.  If mySql is a problem because of
> > > licensing, the path of least resistance to remove mySql dependencies
> > would
> > > be to simply switch to postgresql.  We will always have conventional
> sql
> > > databases in our stack because other big data tools use them. Why not
> > take
> > > advantage of them too?
> > >
> > > Thanks,
> > > James
> > >
> > > 16.01.2017, 12:27, "Matt Foley" <ma...@apache.org>:
> > > > Hi Justin, and team,
> > > > Several components of the Hadoop Stack utilize a SQL database,
> usually
> > > for metadata of some sort. Ambari knows this and arranges for them to
> > share
> > > a single database installation (on or off the cluster), unless they
> > > explicitly configure use of different databases (which is allowed for
> > sites
> > > that desire it). Ambari defaults to using PostgreSQL, altho it’s happy
> to
> > > use MySQL, Oracle, or Microsoft, along with whatever each component
> > > historically defined as their default (such as Derby).
> > > >
> > > > If we want to start with a replacement of current functionality, I
> > would
> > > suggest switching the default database to PostgreSQL. Replacing fast,
> > > efficient, and proven db services with a file-based api library (but no
> > > standard way to propagate the underlying storage files) seems to me to
> be
> > > taking a step backwards.
> > > >
> > > > Sticking with a SQL-based service will surely minimize the amount of
> > > code changes needed. And making the SQL either dialect-independent or
> > > capable of switching among dialects, then enables us to do what the
> rest
> > of
> > > the Hadoop stack does: allow enterprise customers to substitute Oracle
> or
> > > Microsoft enterprise-class databases where they wish. Regarding the
> > > drivers, we should study what the other Stack components do; I’m not an
> > > expert in those areas.
> > > >
> > > > Using the same db as the rest of the stack also means administrators
> > can
> > > be confident they’ve set up adequate backup and recovery processes.
> > > > All these are valuable reasons not to roll our own storage system for
> > > this enrichment data. IMO, of course.
> > > >
> > > > Cheers,
> > > > --Matt
> > > >
> > > > On 1/16/17, 9:52 AM, "Kyle Richardson" <kylerichards...@gmail.com>
> > > wrote:
> > > >
> > > >     +1 Agree with David's order
> > > >
> > > >     -Kyle
> > > >
> > > >     On Mon, Jan 16, 2017 at 12:41 PM, David Lyle <
> dlyle65...@gmail.com
> > >
> > > wrote:
> > > >
> > > >     > Def agree on the parity point.
> > > >     >
> > > >     > I'm a little worried about Supervisor relocations for non-HBase
> > > solutions,
> > > >     > but having much of the work done for us by MaxMind changes my
> > > preference to
> > > >     > (in order)
> > > >     >
> > > >     > 1) MM API
> > > >     > 2) HBase Enrichment
> > > >     > 3) MapDB should the others prove not feasible
> > > >     >
> > > >     >
> > > >     > -D...
> > > >     >
> > > >     >
> > > >     > On Mon, Jan 16, 2017 at 12:15 PM, Justin Leet <
> > > justinjl...@gmail.com>
> > > >     > wrote:
> > > >     >
> > > >     > > I definitely agree on checking out the MaxMind API. I'll
> take a
> > > look at
> > > >     > > it, but at first glance it looks like it does include
> > everything
> > > we use.
> > > >     > > Great find, JJ.
> > > >     > >
> > > >     > > More details on various people's points:
> > > >     > >
> > > >     > > As a note to anyone hopping in, Simon's point on the range
> > > lookup vs a
> > > >     > key
> > > >     > > lookup is why it becomes a Scan in HBase vs a Get. As an
> > > addendum to
> > > >     > what
> > > >     > > Simon mentioned, denormalizing is easy enough and turns it
> into
> > > an easy
> > > >     > > range lookup.
> > > >     > >
> > > >     > > To David's point, the MapDB does require a network hop, but
> > it's
> > > once per
> > > >     > > refresh of the data (Got a relevant callback? Grab new data,
> > > load it,
> > > >     > swap
> > > >     > > out) instead of (up to) once per message. I would expect the
> > > same to be
> > > >     > > true of the MaxMind db files.
> > > >     > >
> > > >     > > I'd also argue MapDB not really more complex than refreshing
> > the
> > > HBase
> > > >     > > table, because we potentially have to start worrying about
> > > things like
> > > >     > > hashing and/or indices and even just general data
> represtation.
> > > It's
> > > >     > > definitely correct that the file processing has to occur on
> > > either path,
> > > >     > so
> > > >     > > it really boils down to handling the callback and reloading
> the
> > > file vs
> > > >     > > handling some of the standard HBasey things. I don't think
> > > either is an
> > > >     > > enormous amount of work (and both are almost certainly more
> > work
> > > than
> > > >     > > MaxMind's API)
> > > >     > >
> > > >     > > Regarding extensibility, I'd argue for parity with what we
> have
> > > first,
> > > >     > then
> > > >     > > build what we need from there. Does anybody have any
> > > disagreement with
> > > >     > > that approach for right now?
> > > >     > >
> > > >     > > Justin
> > > >     > >
> > > >     > > On Mon, Jan 16, 2017 at 12:04 PM, David Lyle <
> > > dlyle65...@gmail.com>
> > > >     > wrote:
> > > >     > >
> > > >     > > > It is interesting- save us a ton of effort, and has the
> right
> > > license.
> > > >     > I
> > > >     > > > think it's worth at least checking out.
> > > >     > > >
> > > >     > > > -D...
> > > >     > > >
> > > >     > > >
> > > >     > > > On Mon, Jan 16, 2017 at 12:00 PM, Simon Elliston Ball <
> > > >     > > > si...@simonellistonball.com> wrote:
> > > >     > > >
> > > >     > > > > I like that approach even more. That way we would only
> have
> > > to worry
> > > >     > > > about
> > > >     > > > > distributing the database file in binary format to all
> the
> > > supervisor
> > > >     > > > nodes
> > > >     > > > > on update.
> > > >     > > > >
> > > >     > > > > It would also make it easier for people to switch to the
> > > enterprise
> > > >     > DB
> > > >     > > > > potentially if they had the license.
> > > >     > > > >
> > > >     > > > > One slight issue with this might be for people who wanted
> > to
> > > extend
> > > >     > the
> > > >     > > > > database. For example, organisations may want to add
> > > geo-enrichment
> > > >     > to
> > > >     > > > > their own private network addresses based modified
> versions
> > > of the
> > > >     > geo
> > > >     > > > > database. Currently we don’t really allow this, since we
> > > hard-code
> > > >     > > > ignoring
> > > >     > > > > private network classes into the geo enrichment adapter,
> > but
> > > I can
> > > >     > see
> > > >     > > a
> > > >     > > > > case where a global org might want to add their own
> ranges
> > > and
> > > >     > > locations
> > > >     > > > to
> > > >     > > > > the data set. Does that make sense to anyone else?
> > > >     > > > >
> > > >     > > > > Simon
> > > >     > > > >
> > > >     > > > >
> > > >     > > > > > On 16 Jan 2017, at 16:50, JJ Meyer <jjmey...@gmail.com
> >
> > > wrote:
> > > >     > > > > >
> > > >     > > > > > Hello all,
> > > >     > > > > >
> > > >     > > > > > Can we leverage maxmind's Java client (
> > > >     > > > > > https://github.com/maxmind/
> GeoIP2-java/tree/master/src/
> > > >     > > > > main/java/com/maxmind/geoip2)
> > > >     > > > > > in this case? I believe it can directly read maxmind
> > file.
> > > Plus I
> > > >     > > think
> > > >     > > > > it
> > > >     > > > > > also has some support for caching as well.
> > > >     > > > > >
> > > >     > > > > > Thanks,
> > > >     > > > > > JJ
> > > >     > > > > >
> > > >     > > > > > On Mon, Jan 16, 2017 at 10:32 AM, Simon Elliston Ball <
> > > >     > > > > > si...@simonellistonball.com> wrote:
> > > >     > > > > >
> > > >     > > > > >> I like the idea of MapDB, since we can essentially
> pull
> > an
> > > >     > instance
> > > >     > > > into
> > > >     > > > > >> each supervisor, so it makes a lot of sense for
> > > relatively small
> > > >     > > > scale,
> > > >     > > > > >> relatively static enrichments in general.
> > > >     > > > > >>
> > > >     > > > > >> Generally this feels like a caching problem, and would
> > be
> > > for a
> > > >     > > simple
> > > >     > > > > >> key-value lookup. In that case I would agree with
> David
> > > Lyle on
> > > >     > > using
> > > >     > > > > HBase
> > > >     > > > > >> as a source or truth and relying on caching.
> > > >     > > > > >>
> > > >     > > > > >> That said, GeoIP is a different lookup pattern, since
> > > it’s a range
> > > >     > > > > lookup
> > > >     > > > > >> then a key lookup (or if we denormalize the MaxMind
> > data,
> > > just a
> > > >     > > range
> > > >     > > > > >> lookup) for that kind of thing MapDB with something
> like
> > > the BTree
> > > >     > > > > seems a
> > > >     > > > > >> good fit.
> > > >     > > > > >>
> > > >     > > > > >> Simon
> > > >     > > > > >>
> > > >     > > > > >>
> > > >     > > > > >>> On 16 Jan 2017, at 16:28, David Lyle <
> > > dlyle65...@gmail.com>
> > > >     > wrote:
> > > >     > > > > >>>
> > > >     > > > > >>> I'm +1 on removing the MySQL dependency, BUT - I'd
> > > prefer to see
> > > >     > it
> > > >     > > > as
> > > >     > > > > an
> > > >     > > > > >>> HBase enrichment. If our current caching isn't enough
> > to
> > > mitigate
> > > >     > > the
> > > >     > > > > >> above
> > > >     > > > > >>> issues, we have a problem, don't we? Or do we not
> > > recommend HBase
> > > >     > > > > >>> enrichment for per message enrichment in general?
> > > >     > > > > >>>
> > > >     > > > > >>> Also- can you elaborate on how MapDB would not
> require
> > a
> > > network
> > > >     > > hop?
> > > >     > > > > >>> Doesn't this mean we would have to sync the
> enrichment
> > > data to
> > > >     > each
> > > >     > > > > Storm
> > > >     > > > > >>> supervisor? HDFS could (probably would) have a
> network
> > > hop too,
> > > >     > no?
> > > >     > > > > >>>
> > > >     > > > > >>> Fwiw -
> > > >     > > > > >>> "In its place, I've looked at using MapDB, which is a
> > > really easy
> > > >     > > to
> > > >     > > > > use
> > > >     > > > > >>> library for creating Java collections backed by a
> file
> > > (This is
> > > >     > > NOT a
> > > >     > > > > >>> separate installation of anything, it's just a jar
> that
> > > manages
> > > >     > > > > >> interaction
> > > >     > > > > >>> with the file system). Given the slow churn of the
> > GeoIP
> > > files
> > > >     > (I
> > > >     > > > > >> believe
> > > >     > > > > >>> they get updated once a week), we can have a script
> > that
> > > can be
> > > >     > run
> > > >     > > > > when
> > > >     > > > > >>> needed, downloads the MaxMind tar file, builds the
> > MapDB
> > > file
> > > >     > that
> > > >     > > > will
> > > >     > > > > >> be
> > > >     > > > > >>> used by the bolts, and places it into HDFS. Finally,
> we
> > > update a
> > > >     > > > > config
> > > >     > > > > >> to
> > > >     > > > > >>> point to the new file, the bolts get the updated
> config
> > > callback
> > > >     > > and
> > > >     > > > > can
> > > >     > > > > >>> update their db files. Inside the code, we wrap the
> > MapDB
> > > >     > portions
> > > >     > > > to
> > > >     > > > > >> make
> > > >     > > > > >>> it transparent to downstream code."
> > > >     > > > > >>>
> > > >     > > > > >>> Seems a bit more complex than "refresh the hbase
> > table".
> > > Afaik,
> > > >     > > > either
> > > >     > > > > >>> approach would require some sort of translation
> between
> > > GeoIP
> > > >     > > source
> > > >     > > > > >> format
> > > >     > > > > >>> and target format, so that part is a wash imo.
> > > >     > > > > >>>
> > > >     > > > > >>> So, I'd really like to see, at least, an attempt to
> > > leverage
> > > >     > HBase
> > > >     > > > > >>> enrichment.
> > > >     > > > > >>>
> > > >     > > > > >>> -D...
> > > >     > > > > >>>
> > > >     > > > > >>>
> > > >     > > > > >>> On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <
> > > >     > ceste...@gmail.com
> > > >     > > >
> > > >     > > > > >> wrote:
> > > >     > > > > >>>
> > > >     > > > > >>>> I think that it's a sensible thing to use MapDB for
> > the
> > > geo
> > > >     > > > > enrichment.
> > > >     > > > > >>>> Let me state my reasoning:
> > > >     > > > > >>>>
> > > >     > > > > >>>> - An HBase implementation would necessitate a HBase
> > scan
> > > >     > > possibly
> > > >     > > > > >>>> hitting HDFS, which is expensive per-message.
> > > >     > > > > >>>> - An HBase implementation would necessitate a
> network
> > > hop and
> > > >     > > MapDB
> > > >     > > > > >>>> would not.
> > > >     > > > > >>>>
> > > >     > > > > >>>> I also think this might be the beginning of a more
> > > general
> > > >     > purpose
> > > >     > > > > >> support
> > > >     > > > > >>>> in Stellar for locally shipped, read-only MapDB
> > > lookups, which
> > > >     > > might
> > > >     > > > > be
> > > >     > > > > >>>> interesting.
> > > >     > > > > >>>>
> > > >     > > > > >>>> In short, all quotes about premature optimization
> are
> > > sure to
> > > >     > > apply
> > > >     > > > to
> > > >     > > > > >> my
> > > >     > > > > >>>> reasoning, but I can't help but have my spidey
> senses
> > > tingle
> > > >     > when
> > > >     > > we
> > > >     > > > > >>>> introduce a scan-per-message architecture.
> > > >     > > > > >>>>
> > > >     > > > > >>>> Casey
> > > >     > > > > >>>>
> > > >     > > > > >>>> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <
> > > >     > > > > >> dima.koval...@sstech.us>
> > > >     > > > > >>>> wrote:
> > > >     > > > > >>>>
> > > >     > > > > >>>>> Hello Justin,
> > > >     > > > > >>>>>
> > > >     > > > > >>>>> Considering that Metron uses hbase tables for
> storing
> > > >     > enrichment
> > > >     > > > and
> > > >     > > > > >>>>> threatintel feeds, can we use Hbase for geo
> > enrichment
> > > as well?
> > > >     > > > > >>>>> Or MapDB can be used for enrichment and threatintel
> > > feeds
> > > >     > instead
> > > >     > > > of
> > > >     > > > > >>>> hbase?
> > > >     > > > > >>>>>
> > > >     > > > > >>>>> - Dima
> > > >     > > > > >>>>>
> > > >     > > > > >>>>> On 01/16/2017 04:17 PM, Justin Leet wrote:
> > > >     > > > > >>>>>> Hi all,
> > > >     > > > > >>>>>>
> > > >     > > > > >>>>>> As a bit of background, right now, GeoIP data is
> > > loaded into
> > > >     > and
> > > >     > > > > >>>> managed
> > > >     > > > > >>>>> by
> > > >     > > > > >>>>>> MySQL (the connectors are LGPL licensed and we
> need
> > > to sever
> > > >     > our
> > > >     > > > > Maven
> > > >     > > > > >>>>>> dependency on it before next release). We
> currently
> > > depend on
> > > >     > > and
> > > >     > > > > >>>> install
> > > >     > > > > >>>>>> an instance of MySQL (in each of the Management
> > Pack,
> > > Ansible,
> > > >     > > and
> > > >     > > > > >>>> Docker
> > > >     > > > > >>>>>> installs). In the topology, we use the JDBCAdapter
> > to
> > > connect
> > > >     > to
> > > >     > > > > MySQL
> > > >     > > > > >>>>> and
> > > >     > > > > >>>>>> query for a given IP. Additionally, it's a single
> > > point of
> > > >     > > > failure
> > > >     > > > > >> for
> > > >     > > > > >>>>>> that particular enrichment right now. If MySQL is
> > > down, geo
> > > >     > > > > >> enrichment
> > > >     > > > > >>>>>> can't occur.
> > > >     > > > > >>>>>>
> > > >     > > > > >>>>>> I'm proposing that we eliminate the use of MySQL
> > > entirely,
> > > >     > > through
> > > >     > > > > all
> > > >     > > > > >>>>>> installation paths (which, unless I missed some,
> > > includes
> > > >     > > Ansible,
> > > >     > > > > the
> > > >     > > > > >>>>>> Ambari Management Pack, and Docker). We'd do this
> by
> > > dropping
> > > >     > > all
> > > >     > > > > the
> > > >     > > > > >>>>>> various MySQL setup and management through the
> code,
> > > along
> > > >     > with
> > > >     > > > all
> > > >     > > > > >> the
> > > >     > > > > >>>>>> DDL, etc. The JDBCAdapter would stay, so that
> > anybody
> > > who
> > > >     > wants
> > > >     > > > to
> > > >     > > > > >>>> setup
> > > >     > > > > >>>>>> their own databases for enrichments and install
> > > connectors is
> > > >     > > able
> > > >     > > > > to
> > > >     > > > > >>>> do
> > > >     > > > > >>>>> so.
> > > >     > > > > >>>>>>
> > > >     > > > > >>>>>> In its place, I've looked at using MapDB, which
> is a
> > > really
> > > >     > easy
> > > >     > > > to
> > > >     > > > > >> use
> > > >     > > > > >>>>>> library for creating Java collections backed by a
> > > file (This
> > > >     > is
> > > >     > > > NOT
> > > >     > > > > a
> > > >     > > > > >>>>>> separate installation of anything, it's just a jar
> > > that
> > > >     > manages
> > > >     > > > > >>>>> interaction
> > > >     > > > > >>>>>> with the file system). Given the slow churn of the
> > > GeoIP
> > > >     > files
> > > >     > > (I
> > > >     > > > > >>>>> believe
> > > >     > > > > >>>>>> they get updated once a week), we can have a
> script
> > > that can
> > > >     > be
> > > >     > > > run
> > > >     > > > > >>>> when
> > > >     > > > > >>>>>> needed, downloads the MaxMind tar file, builds the
> > > MapDB file
> > > >     > > that
> > > >     > > > > >> will
> > > >     > > > > >>>>> be
> > > >     > > > > >>>>>> used by the bolts, and places it into HDFS.
> Finally,
> > > we
> > > >     > update
> > > >     > > a
> > > >     > > > > >>>> config
> > > >     > > > > >>>>> to
> > > >     > > > > >>>>>> point to the new file, the bolts get the updated
> > > config
> > > >     > callback
> > > >     > > > and
> > > >     > > > > >>>> can
> > > >     > > > > >>>>>> update their db files. Inside the code, we wrap
> the
> > > MapDB
> > > >     > > > portions
> > > >     > > > > to
> > > >     > > > > >>>>> make
> > > >     > > > > >>>>>> it transparent to downstream code.
> > > >     > > > > >>>>>>
> > > >     > > > > >>>>>> The particularly nice parts about using MapDB are
> > > that its
> > > >     > ease
> > > >     > > of
> > > >     > > > > use
> > > >     > > > > >>>>> plus
> > > >     > > > > >>>>>> it offers the utilities we need out of the box to
> be
> > > able to
> > > >     > > > support
> > > >     > > > > >>>> the
> > > >     > > > > >>>>>> operations we need on this (Keep in mind the GeoIP
> > > files use
> > > >     > IP
> > > >     > > > > ranges
> > > >     > > > > >>>>> and
> > > >     > > > > >>>>>> we need to be able to easily grab the appropriate
> > > range).
> > > >     > > > > >>>>>>
> > > >     > > > > >>>>>> The main point of concern I have about this is
> that
> > > when we
> > > >     > grab
> > > >     > > > the
> > > >     > > > > >>>> HDFS
> > > >     > > > > >>>>>> file during an update, given that multiple JVMs
> can
> > be
> > > >     > running,
> > > >     > > we
> > > >     > > > > >>>> don't
> > > >     > > > > >>>>>> want them to clobber each other. I believe this
> can
> > > be avoided
> > > >     > > by
> > > >     > > > > >>>> simply
> > > >     > > > > >>>>>> using each worker's working directory to store the
> > > file (and
> > > >     > > > > >>>>> appropriately
> > > >     > > > > >>>>>> ensure threads on the same JVM manage
> > > multithreading). This
> > > >     > > > should
> > > >     > > > > >>>> keep
> > > >     > > > > >>>>>> the JVMs (and the underlying DB files) entirely
> > > independent.
> > > >     > > > > >>>>>>
> > > >     > > > > >>>>>> This script would get called by the various
> > > installations
> > > >     > during
> > > >     > > > > >>>> startup
> > > >     > > > > >>>>> to
> > > >     > > > > >>>>>> do the initial setup. After install, it can then
> be
> > > called on
> > > >     > > > > demand
> > > >     > > > > >>>> in
> > > >     > > > > >>>>>> order.
> > > >     > > > > >>>>>>
> > > >     > > > > >>>>>> At this point, we should be all set, with
> everything
> > > running
> > > >     > and
> > > >     > > > > >>>>> updatable.
> > > >     > > > > >>>>>>
> > > >     > > > > >>>>>> Justin
> > > >     > > > > >>>>>>
> > > >     > > > > >>>>>
> > > >     > > > > >>>>>
> > > >     > > > > >>>>
> > > >     > > > > >>
> > > >     > > > > >>
> > > >     > > > >
> > > >     > > > >
> > > >     > > >
> > > >     > >
> > > >     >
> > >
> > > -------------------
> > > Thank you,
> > >
> > > James Sirota
> > > PPMC- Apache Metron (Incubating)
> > > jsirota AT apache DOT org
> > >
> >
>
>
>
> --
> Nick Allen <n...@nickallen.org>
>

Re: [DISCUSS] Moving GeoIP management away from MySQL

Reply via email to