The key should be a solved problem as of METRON-765 <https://github.com/apache/metron/commit/27b0d6e31de94317b085766349a892395f0d3309>, right? It provides a single key for a given message that globally stored with the message, regardless of where/how.
Jon On Thu, Jun 22, 2017 at 9:01 AM Justin Leet <[email protected]> wrote: > First off, I agree with the characteristics. > > For the data stores, we'll need to be able to make sure we can actually > handle the collapsing of the updates into a single view. Casey mentioned > making the long term stores transparent, but there's potentially work for > the near real time stores: we need to make sure we actually do updates, > rather than create new docs that aren't linked to the old ones. This should > be entirely transparent and handled by a service layer, rather than > anything hardcoded to a datastore. > > For ES at least, the only way to do this is to retrieve, mutate it, and > then reindex (even the updates API does that dance under the hood for you, > and since we're potentially doing non trivial changes we might need to > manage it ourselves). This implies the existence of a key, even if one > isn't enforced by ES (Which I don't believe it will be). We need to be > able to grab the doc(s?) to be updated, not end up with similar ones that > shouldn't be mutated. I assume this is also true (at least the > generalities) of Solr as well. > > In concert with your other thread, couldn't part of this key end up being > metadata (either user defined or environment defined)? For example, in a > situation where customer id is applied as metadata, it's possible two > customers feed off the same datasource, but may need to mutate > independently. At this point, we have metadata that is effectively keyed. > We don't want to update both docs, but there's not a real way to > distinguish them. And maybe that's something we push off for the short > term, but it seems potentially nontrivial. > > In terms of consistency, I'd definitely agree that the long-term storage > can be eventually consistent. Any type of bulk spelunking, Spark jobs, > dashboarding, etc. shouldn't need up to the millisecond data. > > Basically, I'm thinking the real time store is the snapshot of current > state, and the long term store is the full record complete with the lineage > history. > > I'm also interested in people's opinions on how we want to manage HDFS. > Assuming we do use HBase to store our updates, that means that every HDFS > op has to join onto that HBase table to get any updates that HDFS is > missing (unless we implement some writeback and merge for HDFS data). I'm > worried that our two datastores are really: ES, HDFS+HBase. And that > keeping that data actually synced to end users is going to be painful. > > Justin > > > On Wed, Jun 21, 2017 at 10:18 PM, Simon Elliston Ball < > [email protected]> wrote: > > > I'd say that was an excellent set of requirements (very similar to the > one > > we arrived on with the last discuss thread on this) > > > > My vote remains a transaction log in hbase given the relatively low > volume > > (human scale) i would not expect this to need anything fancy like > > compaction into hdfs state, but that does make a good argument for a long > > term dataframe solution for spark, with a short term stop gap using a > > joined data frame and shc. > > > > Simon > > > > Sent from my iPhone > > > > > On 22 Jun 2017, at 05:11, Otto Fowler <[email protected]> wrote: > > > > > > Can you clarify what data stores are at play here? > > > > > > > > > On June 21, 2017 at 17:07:42, Casey Stella ([email protected]) wrote: > > > > > > Hi All, > > > > > > I know we've had a couple of these already, but we're due for another > > > discussion of a sensible approach to mutating indexed data. The > > motivation > > > for this is users will want to update fields to correct and augment > data. > > > These corrections are invaluable for things like feedback for ML models > > or > > > just plain providing better context when evaluating alerts, etc. > > > > > > Rather than posing a solution, I'd like to pose the characteristics of > a > > > solution and we can fight about those first. ;) > > > > > > In my mind, the following are the characteristics that I'd look for: > > > > > > - Changes should be considered additional or replacement fields for > > > existing fields > > > - Changes need to be available in the web view in near real time (on > the > > > order of milliseconds) > > > - Changes should be available in the batch view > > > - I'd be ok with eventually consistent with the web view, thoughts? > > > - Changes should have lineage preserved > > > - Current value is the optimized path > > > - Lineage search is the less optimized path > > > - If HBase is part of a solution > > > - maintain a scan-free solution > > > - maintain a coprocessor-free solution > > > > > > Most of what I've thought of is something along the lines: > > > > > > - Diffs are stored in columns in a HBase row(s) > > > - row: GUID:current would have one column with the current > > > representation > > > - row: GUID:lineage would have an ordered set of columns representing > > > the lineage diffs > > > - Mutable indices is directly updated (e.g. solr or ES) > > > - We'd probably want to provide transparent read support downstream > > > which supports merging for batch read: > > > - a spark dataframe > > > - a hive serde > > > > > > What I'd like to get out of this discussion is an architecture document > > > with a suggested approach and the necessary JIRAs to split this up. If > > > anyone has suggestions or comments about any of this, please speak up. > > I'd > > > like to actually get this done in the near-term. :) > > > > > > Best, > > > > > > Casey > > > -- Jon
