It is clear to me that we need an independently-stored transaction log that is de-coupled from any of our existing systems. So Simon’s idea of storing the transaction logs in Hbase and being able to reference them via a global ID resonates with me. I like it for the following reasons:
- It makes Metron more pluggable as far as adding additional sources for data storage (for example a graph data base) as well as disabling existing data sources. - It makes enforcing consistency of data between data sources easier. Each data storage system can be pointed to look at the transaction log so when user modifies data in system X and it gets recorded in the transaction log, systems Y and Z can listen for this change and adjust their data accordingly based on the global ID Thanks, James 22.06.2017, 14:09, "Justin Leet" <[email protected]>: > Thanks, Jon, that looks like it should work for the key. I didn't realize > that guid got handled that way, which makes life much easier there. Almost > like we already needed to identify messages or something. At the point we > should be good, since we can easily retrieve, update, put on it. > > We'll also need to make sure any long term storage solution also uses it. > > On Thu, Jun 22, 2017 at 12:52 PM, [email protected] <[email protected]> wrote: > >> The key should be a solved problem as of METRON-765 >> <https://github.com/apache/metron/commit/27b0d6e31de94317b085766349a892 >> 395f0d3309>, >> right? It provides a single key for a given message that globally stored >> with the message, regardless of where/how. >> >> Jon >> >> On Thu, Jun 22, 2017 at 9:01 AM Justin Leet <[email protected]> wrote: >> >> > First off, I agree with the characteristics. >> > >> > For the data stores, we'll need to be able to make sure we can actually >> > handle the collapsing of the updates into a single view. Casey mentioned >> > making the long term stores transparent, but there's potentially work for >> > the near real time stores: we need to make sure we actually do updates, >> > rather than create new docs that aren't linked to the old ones. This >> should >> > be entirely transparent and handled by a service layer, rather than >> > anything hardcoded to a datastore. >> > >> > For ES at least, the only way to do this is to retrieve, mutate it, and >> > then reindex (even the updates API does that dance under the hood for >> you, >> > and since we're potentially doing non trivial changes we might need to >> > manage it ourselves). This implies the existence of a key, even if one >> > isn't enforced by ES (Which I don't believe it will be). We need to be >> > able to grab the doc(s?) to be updated, not end up with similar ones that >> > shouldn't be mutated. I assume this is also true (at least the >> > generalities) of Solr as well. >> > >> > In concert with your other thread, couldn't part of this key end up being >> > metadata (either user defined or environment defined)? For example, in a >> > situation where customer id is applied as metadata, it's possible two >> > customers feed off the same datasource, but may need to mutate >> > independently. At this point, we have metadata that is effectively >> keyed. >> > We don't want to update both docs, but there's not a real way to >> > distinguish them. And maybe that's something we push off for the short >> > term, but it seems potentially nontrivial. >> > >> > In terms of consistency, I'd definitely agree that the long-term storage >> > can be eventually consistent. Any type of bulk spelunking, Spark jobs, >> > dashboarding, etc. shouldn't need up to the millisecond data. >> > >> > Basically, I'm thinking the real time store is the snapshot of current >> > state, and the long term store is the full record complete with the >> lineage >> > history. >> > >> > I'm also interested in people's opinions on how we want to manage HDFS. >> > Assuming we do use HBase to store our updates, that means that every HDFS >> > op has to join onto that HBase table to get any updates that HDFS is >> > missing (unless we implement some writeback and merge for HDFS data). >> I'm >> > worried that our two datastores are really: ES, HDFS+HBase. And that >> > keeping that data actually synced to end users is going to be painful. >> > >> > Justin >> > >> > >> > On Wed, Jun 21, 2017 at 10:18 PM, Simon Elliston Ball < >> > [email protected]> wrote: >> > >> > > I'd say that was an excellent set of requirements (very similar to the >> > one >> > > we arrived on with the last discuss thread on this) >> > > >> > > My vote remains a transaction log in hbase given the relatively low >> > volume >> > > (human scale) i would not expect this to need anything fancy like >> > > compaction into hdfs state, but that does make a good argument for a >> long >> > > term dataframe solution for spark, with a short term stop gap using a >> > > joined data frame and shc. >> > > >> > > Simon >> > > >> > > Sent from my iPhone >> > > >> > > > On 22 Jun 2017, at 05:11, Otto Fowler <[email protected]> >> wrote: >> > > > >> > > > Can you clarify what data stores are at play here? >> > > > >> > > > >> > > > On June 21, 2017 at 17:07:42, Casey Stella ([email protected]) >> wrote: >> > > > >> > > > Hi All, >> > > > >> > > > I know we've had a couple of these already, but we're due for another >> > > > discussion of a sensible approach to mutating indexed data. The >> > > motivation >> > > > for this is users will want to update fields to correct and augment >> > data. >> > > > These corrections are invaluable for things like feedback for ML >> models >> > > or >> > > > just plain providing better context when evaluating alerts, etc. >> > > > >> > > > Rather than posing a solution, I'd like to pose the characteristics >> of >> > a >> > > > solution and we can fight about those first. ;) >> > > > >> > > > In my mind, the following are the characteristics that I'd look for: >> > > > >> > > > - Changes should be considered additional or replacement fields for >> > > > existing fields >> > > > - Changes need to be available in the web view in near real time (on >> > the >> > > > order of milliseconds) >> > > > - Changes should be available in the batch view >> > > > - I'd be ok with eventually consistent with the web view, thoughts? >> > > > - Changes should have lineage preserved >> > > > - Current value is the optimized path >> > > > - Lineage search is the less optimized path >> > > > - If HBase is part of a solution >> > > > - maintain a scan-free solution >> > > > - maintain a coprocessor-free solution >> > > > >> > > > Most of what I've thought of is something along the lines: >> > > > >> > > > - Diffs are stored in columns in a HBase row(s) >> > > > - row: GUID:current would have one column with the current >> > > > representation >> > > > - row: GUID:lineage would have an ordered set of columns representing >> > > > the lineage diffs >> > > > - Mutable indices is directly updated (e.g. solr or ES) >> > > > - We'd probably want to provide transparent read support downstream >> > > > which supports merging for batch read: >> > > > - a spark dataframe >> > > > - a hive serde >> > > > >> > > > What I'd like to get out of this discussion is an architecture >> document >> > > > with a suggested approach and the necessary JIRAs to split this up. >> If >> > > > anyone has suggestions or comments about any of this, please speak >> up. >> > > I'd >> > > > like to actually get this done in the near-term. :) >> > > > >> > > > Best, >> > > > >> > > > Casey >> > > >> > >> -- >> >> Jon ------------------- Thank you, James Sirota PPMC- Apache Metron (Incubating) jsirota AT apache DOT org
