Re: [DISCUSS] Moving GeoIP management away from MySQL

David Lyle Mon, 16 Jan 2017 09:04:51 -0800

It is interesting- save us a ton of effort, and has the right license. I
think it's worth at least checking out.


-D...


On Mon, Jan 16, 2017 at 12:00 PM, Simon Elliston Ball <
[email protected]> wrote:

> I like that approach even more. That way we would only have to worry about
> distributing the database file in binary format to all the supervisor nodes
> on update.
>
> It would also make it easier for people to switch to the enterprise DB
> potentially if they had the license.
>
> One slight issue with this might be for people who wanted to extend the
> database. For example, organisations may want to add geo-enrichment to
> their own private network addresses based modified versions of the geo
> database. Currently we don’t really allow this, since we hard-code ignoring
> private network classes into the geo enrichment adapter, but I can see a
> case where a global org might want to add their own ranges and locations to
> the data set. Does that make sense to anyone else?
>
> Simon
>
>
> > On 16 Jan 2017, at 16:50, JJ Meyer <[email protected]> wrote:
> >
> > Hello all,
> >
> > Can we leverage maxmind's Java client (
> > https://github.com/maxmind/GeoIP2-java/tree/master/src/
> main/java/com/maxmind/geoip2)
> > in this case? I believe it can directly read maxmind file. Plus I think
> it
> > also has some support for caching as well.
> >
> > Thanks,
> > JJ
> >
> > On Mon, Jan 16, 2017 at 10:32 AM, Simon Elliston Ball <
> > [email protected]> wrote:
> >
> >> I like the idea of MapDB, since we can essentially pull an instance into
> >> each supervisor, so it makes a lot of sense for relatively small scale,
> >> relatively static enrichments in general.
> >>
> >> Generally this feels like a caching problem, and would be for a simple
> >> key-value lookup. In that case I would agree with David Lyle on using
> HBase
> >> as a source or truth and relying on caching.
> >>
> >> That said, GeoIP is a different lookup pattern, since it’s a range
> lookup
> >> then a key lookup (or if we denormalize the MaxMind data, just a range
> >> lookup) for that kind of thing MapDB with something like the BTree
> seems a
> >> good fit.
> >>
> >> Simon
> >>
> >>
> >>> On 16 Jan 2017, at 16:28, David Lyle <[email protected]> wrote:
> >>>
> >>> I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see it as
> an
> >>> HBase enrichment. If our current caching isn't enough to mitigate the
> >> above
> >>> issues, we have a problem, don't we? Or do we not recommend HBase
> >>> enrichment for per message enrichment in general?
> >>>
> >>> Also- can you elaborate on how MapDB would not require a network hop?
> >>> Doesn't this mean we would have to sync the enrichment data to each
> Storm
> >>> supervisor? HDFS could (probably would) have a network hop too, no?
> >>>
> >>> Fwiw -
> >>> "In its place, I've looked at using MapDB, which is a really easy to
> use
> >>> library for creating Java collections backed by a file (This is NOT a
> >>> separate installation of anything, it's just a jar that manages
> >> interaction
> >>> with the file system).  Given the slow churn of the GeoIP files (I
> >> believe
> >>> they get updated once a week), we can have a script that can be run
> when
> >>> needed, downloads the MaxMind tar file, builds the MapDB file that will
> >> be
> >>> used by the bolts, and places it into HDFS.  Finally, we update a
> config
> >> to
> >>> point to the new file, the bolts get the updated config callback and
> can
> >>> update their db files.  Inside the code, we wrap the MapDB portions to
> >> make
> >>> it transparent to downstream code."
> >>>
> >>> Seems a bit more complex than "refresh the hbase table". Afaik, either
> >>> approach would require some sort of translation between GeoIP source
> >> format
> >>> and target format, so that part is a wash imo.
> >>>
> >>> So, I'd really like to see, at least, an attempt to leverage HBase
> >>> enrichment.
> >>>
> >>> -D...
> >>>
> >>>
> >>> On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <[email protected]>
> >> wrote:
> >>>
> >>>> I think that it's a sensible thing to use MapDB for the geo
> enrichment.
> >>>> Let me state my reasoning:
> >>>>
> >>>>  - An HBase implementation  would necessitate a HBase scan possibly
> >>>>  hitting HDFS, which is expensive per-message.
> >>>>  - An HBase implementation would necessitate a network hop and MapDB
> >>>>  would not.
> >>>>
> >>>> I also think this might be the beginning of a more general purpose
> >> support
> >>>> in Stellar for locally shipped, read-only MapDB lookups, which might
> be
> >>>> interesting.
> >>>>
> >>>> In short, all quotes about premature optimization are sure to apply to
> >> my
> >>>> reasoning, but I can't help but have my spidey senses tingle when we
> >>>> introduce a scan-per-message architecture.
> >>>>
> >>>> Casey
> >>>>
> >>>> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <
> >> [email protected]>
> >>>> wrote:
> >>>>
> >>>>> Hello Justin,
> >>>>>
> >>>>> Considering that Metron uses hbase tables for storing enrichment and
> >>>>> threatintel feeds, can we use Hbase for geo enrichment as well?
> >>>>> Or MapDB can be used for enrichment and threatintel feeds instead of
> >>>> hbase?
> >>>>>
> >>>>> - Dima
> >>>>>
> >>>>> On 01/16/2017 04:17 PM, Justin Leet wrote:
> >>>>>> Hi all,
> >>>>>>
> >>>>>> As a bit of background, right now, GeoIP data is loaded into and
> >>>> managed
> >>>>> by
> >>>>>> MySQL (the connectors are LGPL licensed and we need to sever our
> Maven
> >>>>>> dependency on it before next release). We currently depend on and
> >>>> install
> >>>>>> an instance of MySQL (in each of the Management Pack, Ansible, and
> >>>> Docker
> >>>>>> installs). In the topology, we use the JDBCAdapter to connect to
> MySQL
> >>>>> and
> >>>>>> query for a given IP.  Additionally, it's a single point of failure
> >> for
> >>>>>> that particular enrichment right now.  If MySQL is down, geo
> >> enrichment
> >>>>>> can't occur.
> >>>>>>
> >>>>>> I'm proposing that we eliminate the use of MySQL entirely, through
> all
> >>>>>> installation paths (which, unless I missed some, includes Ansible,
> the
> >>>>>> Ambari Management Pack, and Docker).  We'd do this by dropping all
> the
> >>>>>> various MySQL setup and management through the code, along with all
> >> the
> >>>>>> DDL, etc.  The JDBCAdapter would stay, so that anybody who wants to
> >>>> setup
> >>>>>> their own databases for enrichments and install connectors is able
> to
> >>>> do
> >>>>> so.
> >>>>>>
> >>>>>> In its place, I've looked at using MapDB, which is a really easy to
> >> use
> >>>>>> library for creating Java collections backed by a file (This is NOT
> a
> >>>>>> separate installation of anything, it's just a jar that manages
> >>>>> interaction
> >>>>>> with the file system).  Given the slow churn of the GeoIP files (I
> >>>>> believe
> >>>>>> they get updated once a week), we can have a script that can be run
> >>>> when
> >>>>>> needed, downloads the MaxMind tar file, builds the MapDB file that
> >> will
> >>>>> be
> >>>>>> used by the bolts, and places it into HDFS.  Finally, we update a
> >>>> config
> >>>>> to
> >>>>>> point to the new file, the bolts get the updated config callback and
> >>>> can
> >>>>>> update their db files.  Inside the code, we wrap the MapDB portions
> to
> >>>>> make
> >>>>>> it transparent to downstream code.
> >>>>>>
> >>>>>> The particularly nice parts about using MapDB are that its ease of
> use
> >>>>> plus
> >>>>>> it offers the utilities we need out of the box to be able to support
> >>>> the
> >>>>>> operations we need on this (Keep in mind the GeoIP files use IP
> ranges
> >>>>> and
> >>>>>> we need to be able to easily grab the appropriate range).
> >>>>>>
> >>>>>> The main point of concern I have about this is that when we grab the
> >>>> HDFS
> >>>>>> file during an update, given that multiple JVMs can be running, we
> >>>> don't
> >>>>>> want them to clobber each other. I believe this can be avoided by
> >>>> simply
> >>>>>> using each worker's working directory to store the file (and
> >>>>> appropriately
> >>>>>> ensure threads on the same JVM manage multithreading).  This should
> >>>> keep
> >>>>>> the JVMs (and the underlying DB files) entirely independent.
> >>>>>>
> >>>>>> This script would get called by the various installations during
> >>>> startup
> >>>>> to
> >>>>>> do the initial setup.  After install, it can then be called on
> demand
> >>>> in
> >>>>>> order.
> >>>>>>
> >>>>>> At this point, we should be all set, with everything running and
> >>>>> updatable.
> >>>>>>
> >>>>>> Justin
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>
> >>
>
>

Re: [DISCUSS] Moving GeoIP management away from MySQL

Reply via email to