It is interesting- save us a ton of effort, and has the right license. I think it's worth at least checking out.
-D... On Mon, Jan 16, 2017 at 12:00 PM, Simon Elliston Ball < [email protected]> wrote: > I like that approach even more. That way we would only have to worry about > distributing the database file in binary format to all the supervisor nodes > on update. > > It would also make it easier for people to switch to the enterprise DB > potentially if they had the license. > > One slight issue with this might be for people who wanted to extend the > database. For example, organisations may want to add geo-enrichment to > their own private network addresses based modified versions of the geo > database. Currently we don’t really allow this, since we hard-code ignoring > private network classes into the geo enrichment adapter, but I can see a > case where a global org might want to add their own ranges and locations to > the data set. Does that make sense to anyone else? > > Simon > > > > On 16 Jan 2017, at 16:50, JJ Meyer <[email protected]> wrote: > > > > Hello all, > > > > Can we leverage maxmind's Java client ( > > https://github.com/maxmind/GeoIP2-java/tree/master/src/ > main/java/com/maxmind/geoip2) > > in this case? I believe it can directly read maxmind file. Plus I think > it > > also has some support for caching as well. > > > > Thanks, > > JJ > > > > On Mon, Jan 16, 2017 at 10:32 AM, Simon Elliston Ball < > > [email protected]> wrote: > > > >> I like the idea of MapDB, since we can essentially pull an instance into > >> each supervisor, so it makes a lot of sense for relatively small scale, > >> relatively static enrichments in general. > >> > >> Generally this feels like a caching problem, and would be for a simple > >> key-value lookup. In that case I would agree with David Lyle on using > HBase > >> as a source or truth and relying on caching. > >> > >> That said, GeoIP is a different lookup pattern, since it’s a range > lookup > >> then a key lookup (or if we denormalize the MaxMind data, just a range > >> lookup) for that kind of thing MapDB with something like the BTree > seems a > >> good fit. > >> > >> Simon > >> > >> > >>> On 16 Jan 2017, at 16:28, David Lyle <[email protected]> wrote: > >>> > >>> I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see it as > an > >>> HBase enrichment. If our current caching isn't enough to mitigate the > >> above > >>> issues, we have a problem, don't we? Or do we not recommend HBase > >>> enrichment for per message enrichment in general? > >>> > >>> Also- can you elaborate on how MapDB would not require a network hop? > >>> Doesn't this mean we would have to sync the enrichment data to each > Storm > >>> supervisor? HDFS could (probably would) have a network hop too, no? > >>> > >>> Fwiw - > >>> "In its place, I've looked at using MapDB, which is a really easy to > use > >>> library for creating Java collections backed by a file (This is NOT a > >>> separate installation of anything, it's just a jar that manages > >> interaction > >>> with the file system). Given the slow churn of the GeoIP files (I > >> believe > >>> they get updated once a week), we can have a script that can be run > when > >>> needed, downloads the MaxMind tar file, builds the MapDB file that will > >> be > >>> used by the bolts, and places it into HDFS. Finally, we update a > config > >> to > >>> point to the new file, the bolts get the updated config callback and > can > >>> update their db files. Inside the code, we wrap the MapDB portions to > >> make > >>> it transparent to downstream code." > >>> > >>> Seems a bit more complex than "refresh the hbase table". Afaik, either > >>> approach would require some sort of translation between GeoIP source > >> format > >>> and target format, so that part is a wash imo. > >>> > >>> So, I'd really like to see, at least, an attempt to leverage HBase > >>> enrichment. > >>> > >>> -D... > >>> > >>> > >>> On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <[email protected]> > >> wrote: > >>> > >>>> I think that it's a sensible thing to use MapDB for the geo > enrichment. > >>>> Let me state my reasoning: > >>>> > >>>> - An HBase implementation would necessitate a HBase scan possibly > >>>> hitting HDFS, which is expensive per-message. > >>>> - An HBase implementation would necessitate a network hop and MapDB > >>>> would not. > >>>> > >>>> I also think this might be the beginning of a more general purpose > >> support > >>>> in Stellar for locally shipped, read-only MapDB lookups, which might > be > >>>> interesting. > >>>> > >>>> In short, all quotes about premature optimization are sure to apply to > >> my > >>>> reasoning, but I can't help but have my spidey senses tingle when we > >>>> introduce a scan-per-message architecture. > >>>> > >>>> Casey > >>>> > >>>> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov < > >> [email protected]> > >>>> wrote: > >>>> > >>>>> Hello Justin, > >>>>> > >>>>> Considering that Metron uses hbase tables for storing enrichment and > >>>>> threatintel feeds, can we use Hbase for geo enrichment as well? > >>>>> Or MapDB can be used for enrichment and threatintel feeds instead of > >>>> hbase? > >>>>> > >>>>> - Dima > >>>>> > >>>>> On 01/16/2017 04:17 PM, Justin Leet wrote: > >>>>>> Hi all, > >>>>>> > >>>>>> As a bit of background, right now, GeoIP data is loaded into and > >>>> managed > >>>>> by > >>>>>> MySQL (the connectors are LGPL licensed and we need to sever our > Maven > >>>>>> dependency on it before next release). We currently depend on and > >>>> install > >>>>>> an instance of MySQL (in each of the Management Pack, Ansible, and > >>>> Docker > >>>>>> installs). In the topology, we use the JDBCAdapter to connect to > MySQL > >>>>> and > >>>>>> query for a given IP. Additionally, it's a single point of failure > >> for > >>>>>> that particular enrichment right now. If MySQL is down, geo > >> enrichment > >>>>>> can't occur. > >>>>>> > >>>>>> I'm proposing that we eliminate the use of MySQL entirely, through > all > >>>>>> installation paths (which, unless I missed some, includes Ansible, > the > >>>>>> Ambari Management Pack, and Docker). We'd do this by dropping all > the > >>>>>> various MySQL setup and management through the code, along with all > >> the > >>>>>> DDL, etc. The JDBCAdapter would stay, so that anybody who wants to > >>>> setup > >>>>>> their own databases for enrichments and install connectors is able > to > >>>> do > >>>>> so. > >>>>>> > >>>>>> In its place, I've looked at using MapDB, which is a really easy to > >> use > >>>>>> library for creating Java collections backed by a file (This is NOT > a > >>>>>> separate installation of anything, it's just a jar that manages > >>>>> interaction > >>>>>> with the file system). Given the slow churn of the GeoIP files (I > >>>>> believe > >>>>>> they get updated once a week), we can have a script that can be run > >>>> when > >>>>>> needed, downloads the MaxMind tar file, builds the MapDB file that > >> will > >>>>> be > >>>>>> used by the bolts, and places it into HDFS. Finally, we update a > >>>> config > >>>>> to > >>>>>> point to the new file, the bolts get the updated config callback and > >>>> can > >>>>>> update their db files. Inside the code, we wrap the MapDB portions > to > >>>>> make > >>>>>> it transparent to downstream code. > >>>>>> > >>>>>> The particularly nice parts about using MapDB are that its ease of > use > >>>>> plus > >>>>>> it offers the utilities we need out of the box to be able to support > >>>> the > >>>>>> operations we need on this (Keep in mind the GeoIP files use IP > ranges > >>>>> and > >>>>>> we need to be able to easily grab the appropriate range). > >>>>>> > >>>>>> The main point of concern I have about this is that when we grab the > >>>> HDFS > >>>>>> file during an update, given that multiple JVMs can be running, we > >>>> don't > >>>>>> want them to clobber each other. I believe this can be avoided by > >>>> simply > >>>>>> using each worker's working directory to store the file (and > >>>>> appropriately > >>>>>> ensure threads on the same JVM manage multithreading). This should > >>>> keep > >>>>>> the JVMs (and the underlying DB files) entirely independent. > >>>>>> > >>>>>> This script would get called by the various installations during > >>>> startup > >>>>> to > >>>>>> do the initial setup. After install, it can then be called on > demand > >>>> in > >>>>>> order. > >>>>>> > >>>>>> At this point, we should be all set, with everything running and > >>>>> updatable. > >>>>>> > >>>>>> Justin > >>>>>> > >>>>> > >>>>> > >>>> > >> > >> > >
