Re: extensibility - I am one of those enterprise users who plan to do enrichment using their IPAM data in the next couple of months. However, since the information that I have is a much different format compared to maxmind, my approach was going to make a completely separate HBase enricher. That also makes it easier for me to upgrade my Metron cluster in the future, as I would not be customizing a built-in.
That said, I'm game for a follow-on enhancement, but for now this should probably just be a replacement of what currently exists. Jon On Mon, Jan 16, 2017 at 12:15 PM Justin Leet <[email protected]> wrote: > I definitely agree on checking out the MaxMind API. I'll take a look at > it, but at first glance it looks like it does include everything we use. > Great find, JJ. > > More details on various people's points: > > As a note to anyone hopping in, Simon's point on the range lookup vs a key > lookup is why it becomes a Scan in HBase vs a Get. As an addendum to what > Simon mentioned, denormalizing is easy enough and turns it into an easy > range lookup. > > To David's point, the MapDB does require a network hop, but it's once per > refresh of the data (Got a relevant callback? Grab new data, load it, swap > out) instead of (up to) once per message. I would expect the same to be > true of the MaxMind db files. > > I'd also argue MapDB not really more complex than refreshing the HBase > table, because we potentially have to start worrying about things like > hashing and/or indices and even just general data represtation. It's > definitely correct that the file processing has to occur on either path, so > it really boils down to handling the callback and reloading the file vs > handling some of the standard HBasey things. I don't think either is an > enormous amount of work (and both are almost certainly more work than > MaxMind's API) > > Regarding extensibility, I'd argue for parity with what we have first, then > build what we need from there. Does anybody have any disagreement with > that approach for right now? > > Justin > > On Mon, Jan 16, 2017 at 12:04 PM, David Lyle <[email protected]> wrote: > > > It is interesting- save us a ton of effort, and has the right license. I > > think it's worth at least checking out. > > > > -D... > > > > > > On Mon, Jan 16, 2017 at 12:00 PM, Simon Elliston Ball < > > [email protected]> wrote: > > > > > I like that approach even more. That way we would only have to worry > > about > > > distributing the database file in binary format to all the supervisor > > nodes > > > on update. > > > > > > It would also make it easier for people to switch to the enterprise DB > > > potentially if they had the license. > > > > > > One slight issue with this might be for people who wanted to extend the > > > database. For example, organisations may want to add geo-enrichment to > > > their own private network addresses based modified versions of the geo > > > database. Currently we don’t really allow this, since we hard-code > > ignoring > > > private network classes into the geo enrichment adapter, but I can see > a > > > case where a global org might want to add their own ranges and > locations > > to > > > the data set. Does that make sense to anyone else? > > > > > > Simon > > > > > > > > > > On 16 Jan 2017, at 16:50, JJ Meyer <[email protected]> wrote: > > > > > > > > Hello all, > > > > > > > > Can we leverage maxmind's Java client ( > > > > https://github.com/maxmind/GeoIP2-java/tree/master/src/ > > > main/java/com/maxmind/geoip2) > > > > in this case? I believe it can directly read maxmind file. Plus I > think > > > it > > > > also has some support for caching as well. > > > > > > > > Thanks, > > > > JJ > > > > > > > > On Mon, Jan 16, 2017 at 10:32 AM, Simon Elliston Ball < > > > > [email protected]> wrote: > > > > > > > >> I like the idea of MapDB, since we can essentially pull an instance > > into > > > >> each supervisor, so it makes a lot of sense for relatively small > > scale, > > > >> relatively static enrichments in general. > > > >> > > > >> Generally this feels like a caching problem, and would be for a > simple > > > >> key-value lookup. In that case I would agree with David Lyle on > using > > > HBase > > > >> as a source or truth and relying on caching. > > > >> > > > >> That said, GeoIP is a different lookup pattern, since it’s a range > > > lookup > > > >> then a key lookup (or if we denormalize the MaxMind data, just a > range > > > >> lookup) for that kind of thing MapDB with something like the BTree > > > seems a > > > >> good fit. > > > >> > > > >> Simon > > > >> > > > >> > > > >>> On 16 Jan 2017, at 16:28, David Lyle <[email protected]> wrote: > > > >>> > > > >>> I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see it > > as > > > an > > > >>> HBase enrichment. If our current caching isn't enough to mitigate > the > > > >> above > > > >>> issues, we have a problem, don't we? Or do we not recommend HBase > > > >>> enrichment for per message enrichment in general? > > > >>> > > > >>> Also- can you elaborate on how MapDB would not require a network > hop? > > > >>> Doesn't this mean we would have to sync the enrichment data to each > > > Storm > > > >>> supervisor? HDFS could (probably would) have a network hop too, no? > > > >>> > > > >>> Fwiw - > > > >>> "In its place, I've looked at using MapDB, which is a really easy > to > > > use > > > >>> library for creating Java collections backed by a file (This is > NOT a > > > >>> separate installation of anything, it's just a jar that manages > > > >> interaction > > > >>> with the file system). Given the slow churn of the GeoIP files (I > > > >> believe > > > >>> they get updated once a week), we can have a script that can be run > > > when > > > >>> needed, downloads the MaxMind tar file, builds the MapDB file that > > will > > > >> be > > > >>> used by the bolts, and places it into HDFS. Finally, we update a > > > config > > > >> to > > > >>> point to the new file, the bolts get the updated config callback > and > > > can > > > >>> update their db files. Inside the code, we wrap the MapDB portions > > to > > > >> make > > > >>> it transparent to downstream code." > > > >>> > > > >>> Seems a bit more complex than "refresh the hbase table". Afaik, > > either > > > >>> approach would require some sort of translation between GeoIP > source > > > >> format > > > >>> and target format, so that part is a wash imo. > > > >>> > > > >>> So, I'd really like to see, at least, an attempt to leverage HBase > > > >>> enrichment. > > > >>> > > > >>> -D... > > > >>> > > > >>> > > > >>> On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <[email protected] > > > > > >> wrote: > > > >>> > > > >>>> I think that it's a sensible thing to use MapDB for the geo > > > enrichment. > > > >>>> Let me state my reasoning: > > > >>>> > > > >>>> - An HBase implementation would necessitate a HBase scan > possibly > > > >>>> hitting HDFS, which is expensive per-message. > > > >>>> - An HBase implementation would necessitate a network hop and > MapDB > > > >>>> would not. > > > >>>> > > > >>>> I also think this might be the beginning of a more general purpose > > > >> support > > > >>>> in Stellar for locally shipped, read-only MapDB lookups, which > might > > > be > > > >>>> interesting. > > > >>>> > > > >>>> In short, all quotes about premature optimization are sure to > apply > > to > > > >> my > > > >>>> reasoning, but I can't help but have my spidey senses tingle when > we > > > >>>> introduce a scan-per-message architecture. > > > >>>> > > > >>>> Casey > > > >>>> > > > >>>> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov < > > > >> [email protected]> > > > >>>> wrote: > > > >>>> > > > >>>>> Hello Justin, > > > >>>>> > > > >>>>> Considering that Metron uses hbase tables for storing enrichment > > and > > > >>>>> threatintel feeds, can we use Hbase for geo enrichment as well? > > > >>>>> Or MapDB can be used for enrichment and threatintel feeds instead > > of > > > >>>> hbase? > > > >>>>> > > > >>>>> - Dima > > > >>>>> > > > >>>>> On 01/16/2017 04:17 PM, Justin Leet wrote: > > > >>>>>> Hi all, > > > >>>>>> > > > >>>>>> As a bit of background, right now, GeoIP data is loaded into and > > > >>>> managed > > > >>>>> by > > > >>>>>> MySQL (the connectors are LGPL licensed and we need to sever our > > > Maven > > > >>>>>> dependency on it before next release). We currently depend on > and > > > >>>> install > > > >>>>>> an instance of MySQL (in each of the Management Pack, Ansible, > and > > > >>>> Docker > > > >>>>>> installs). In the topology, we use the JDBCAdapter to connect to > > > MySQL > > > >>>>> and > > > >>>>>> query for a given IP. Additionally, it's a single point of > > failure > > > >> for > > > >>>>>> that particular enrichment right now. If MySQL is down, geo > > > >> enrichment > > > >>>>>> can't occur. > > > >>>>>> > > > >>>>>> I'm proposing that we eliminate the use of MySQL entirely, > through > > > all > > > >>>>>> installation paths (which, unless I missed some, includes > Ansible, > > > the > > > >>>>>> Ambari Management Pack, and Docker). We'd do this by dropping > all > > > the > > > >>>>>> various MySQL setup and management through the code, along with > > all > > > >> the > > > >>>>>> DDL, etc. The JDBCAdapter would stay, so that anybody who wants > > to > > > >>>> setup > > > >>>>>> their own databases for enrichments and install connectors is > able > > > to > > > >>>> do > > > >>>>> so. > > > >>>>>> > > > >>>>>> In its place, I've looked at using MapDB, which is a really easy > > to > > > >> use > > > >>>>>> library for creating Java collections backed by a file (This is > > NOT > > > a > > > >>>>>> separate installation of anything, it's just a jar that manages > > > >>>>> interaction > > > >>>>>> with the file system). Given the slow churn of the GeoIP files > (I > > > >>>>> believe > > > >>>>>> they get updated once a week), we can have a script that can be > > run > > > >>>> when > > > >>>>>> needed, downloads the MaxMind tar file, builds the MapDB file > that > > > >> will > > > >>>>> be > > > >>>>>> used by the bolts, and places it into HDFS. Finally, we update > a > > > >>>> config > > > >>>>> to > > > >>>>>> point to the new file, the bolts get the updated config callback > > and > > > >>>> can > > > >>>>>> update their db files. Inside the code, we wrap the MapDB > > portions > > > to > > > >>>>> make > > > >>>>>> it transparent to downstream code. > > > >>>>>> > > > >>>>>> The particularly nice parts about using MapDB are that its ease > of > > > use > > > >>>>> plus > > > >>>>>> it offers the utilities we need out of the box to be able to > > support > > > >>>> the > > > >>>>>> operations we need on this (Keep in mind the GeoIP files use IP > > > ranges > > > >>>>> and > > > >>>>>> we need to be able to easily grab the appropriate range). > > > >>>>>> > > > >>>>>> The main point of concern I have about this is that when we grab > > the > > > >>>> HDFS > > > >>>>>> file during an update, given that multiple JVMs can be running, > we > > > >>>> don't > > > >>>>>> want them to clobber each other. I believe this can be avoided > by > > > >>>> simply > > > >>>>>> using each worker's working directory to store the file (and > > > >>>>> appropriately > > > >>>>>> ensure threads on the same JVM manage multithreading). This > > should > > > >>>> keep > > > >>>>>> the JVMs (and the underlying DB files) entirely independent. > > > >>>>>> > > > >>>>>> This script would get called by the various installations during > > > >>>> startup > > > >>>>> to > > > >>>>>> do the initial setup. After install, it can then be called on > > > demand > > > >>>> in > > > >>>>>> order. > > > >>>>>> > > > >>>>>> At this point, we should be all set, with everything running and > > > >>>>> updatable. > > > >>>>>> > > > >>>>>> Justin > > > >>>>>> > > > >>>>> > > > >>>>> > > > >>>> > > > >> > > > >> > > > > > > > > > -- Jon Sent from my mobile device
