This problem is not uncommon, I would think. This should be implemented as ‘clean’ as possible such that it can be spun out. It would also be a candidate for a feature/collaboration/long branch
On June 26, 2017 at 12:44:44, Casey Stella ([email protected]) wrote: When we're talking about a "transaction log", an edit could involve multiple delete/additions, so are we proposing storing a diff to the JSON map as the representation of a particular transaction? I proposed pre-caching the current value to lessen the burden on the reader (i.e. not having to merge the transactions into current state), what do we think of that? Also, I want to ensure we maintain a solution that is scan-free: the edits should exist as separate columns rather than separate rows in the NoSQL store. Thoughts? On Mon, Jun 26, 2017 at 5:36 PM, James Sirota <[email protected]> wrote: > It is clear to me that we need an independently-stored transaction log > that is de-coupled from any of our existing systems. So Simon’s idea of > storing the transaction logs in Hbase and being able to reference them via > a global ID resonates with me. I like it for the following reasons: > > - It makes Metron more pluggable as far as adding additional sources for > data storage (for example a graph data base) as well as disabling existing > data sources. > > - It makes enforcing consistency of data between data sources easier. > Each data storage system can be pointed to look at the transaction log so > when user modifies data in system X and it gets recorded in the transaction > log, systems Y and Z can listen for this change and adjust their data > accordingly based on the global ID > > Thanks, James > > > 22.06.2017, 14:09, "Justin Leet" <[email protected]>: > > Thanks, Jon, that looks like it should work for the key. I didn't realize > > that guid got handled that way, which makes life much easier there. > Almost > > like we already needed to identify messages or something. At the point we > > should be good, since we can easily retrieve, update, put on it. > > > > We'll also need to make sure any long term storage solution also uses it. > > > > On Thu, Jun 22, 2017 at 12:52 PM, [email protected] <[email protected]> > wrote: > > > >> The key should be a solved problem as of METRON-765 > >> <https://github.com/apache/metron/commit/ > 27b0d6e31de94317b085766349a892 > >> 395f0d3309>, > >> right? It provides a single key for a given message that globally > stored > >> with the message, regardless of where/how. > >> > >> Jon > >> > >> On Thu, Jun 22, 2017 at 9:01 AM Justin Leet <[email protected]> > wrote: > >> > >> > First off, I agree with the characteristics. > >> > > >> > For the data stores, we'll need to be able to make sure we can > actually > >> > handle the collapsing of the updates into a single view. Casey > mentioned > >> > making the long term stores transparent, but there's potentially > work for > >> > the near real time stores: we need to make sure we actually do > updates, > >> > rather than create new docs that aren't linked to the old ones. This > >> should > >> > be entirely transparent and handled by a service layer, rather than > >> > anything hardcoded to a datastore. > >> > > >> > For ES at least, the only way to do this is to retrieve, mutate it, > and > >> > then reindex (even the updates API does that dance under the hood for > >> you, > >> > and since we're potentially doing non trivial changes we might need > to > >> > manage it ourselves). This implies the existence of a key, even if > one > >> > isn't enforced by ES (Which I don't believe it will be). We need to > be > >> > able to grab the doc(s?) to be updated, not end up with similar ones > that > >> > shouldn't be mutated. I assume this is also true (at least the > >> > generalities) of Solr as well. > >> > > >> > In concert with your other thread, couldn't part of this key end up > being > >> > metadata (either user defined or environment defined)? For example, > in a > >> > situation where customer id is applied as metadata, it's possible two > >> > customers feed off the same datasource, but may need to mutate > >> > independently. At this point, we have metadata that is effectively > >> keyed. > >> > We don't want to update both docs, but there's not a real way to > >> > distinguish them. And maybe that's something we push off for the > short > >> > term, but it seems potentially nontrivial. > >> > > >> > In terms of consistency, I'd definitely agree that the long-term > storage > >> > can be eventually consistent. Any type of bulk spelunking, Spark > jobs, > >> > dashboarding, etc. shouldn't need up to the millisecond data. > >> > > >> > Basically, I'm thinking the real time store is the snapshot of > current > >> > state, and the long term store is the full record complete with the > >> lineage > >> > history. > >> > > >> > I'm also interested in people's opinions on how we want to manage > HDFS. > >> > Assuming we do use HBase to store our updates, that means that every > HDFS > >> > op has to join onto that HBase table to get any updates that HDFS is > >> > missing (unless we implement some writeback and merge for HDFS data). > >> I'm > >> > worried that our two datastores are really: ES, HDFS+HBase. And that > >> > keeping that data actually synced to end users is going to be > painful. > >> > > >> > Justin > >> > > >> > > >> > On Wed, Jun 21, 2017 at 10:18 PM, Simon Elliston Ball < > >> > [email protected]> wrote: > >> > > >> > > I'd say that was an excellent set of requirements (very similar to > the > >> > one > >> > > we arrived on with the last discuss thread on this) > >> > > > >> > > My vote remains a transaction log in hbase given the relatively low > >> > volume > >> > > (human scale) i would not expect this to need anything fancy like > >> > > compaction into hdfs state, but that does make a good argument for > a > >> long > >> > > term dataframe solution for spark, with a short term stop gap > using a > >> > > joined data frame and shc. > >> > > > >> > > Simon > >> > > > >> > > Sent from my iPhone > >> > > > >> > > > On 22 Jun 2017, at 05:11, Otto Fowler <[email protected]> > >> wrote: > >> > > > > >> > > > Can you clarify what data stores are at play here? > >> > > > > >> > > > > >> > > > On June 21, 2017 at 17:07:42, Casey Stella ([email protected]) > >> wrote: > >> > > > > >> > > > Hi All, > >> > > > > >> > > > I know we've had a couple of these already, but we're due for > another > >> > > > discussion of a sensible approach to mutating indexed data. The > >> > > motivation > >> > > > for this is users will want to update fields to correct and > augment > >> > data. > >> > > > These corrections are invaluable for things like feedback for ML > >> models > >> > > or > >> > > > just plain providing better context when evaluating alerts, etc. > >> > > > > >> > > > Rather than posing a solution, I'd like to pose the > characteristics > >> of > >> > a > >> > > > solution and we can fight about those first. ;) > >> > > > > >> > > > In my mind, the following are the characteristics that I'd look > for: > >> > > > > >> > > > - Changes should be considered additional or replacement fields > for > >> > > > existing fields > >> > > > - Changes need to be available in the web view in near real time > (on > >> > the > >> > > > order of milliseconds) > >> > > > - Changes should be available in the batch view > >> > > > - I'd be ok with eventually consistent with the web view, > thoughts? > >> > > > - Changes should have lineage preserved > >> > > > - Current value is the optimized path > >> > > > - Lineage search is the less optimized path > >> > > > - If HBase is part of a solution > >> > > > - maintain a scan-free solution > >> > > > - maintain a coprocessor-free solution > >> > > > > >> > > > Most of what I've thought of is something along the lines: > >> > > > > >> > > > - Diffs are stored in columns in a HBase row(s) > >> > > > - row: GUID:current would have one column with the current > >> > > > representation > >> > > > - row: GUID:lineage would have an ordered set of columns > representing > >> > > > the lineage diffs > >> > > > - Mutable indices is directly updated (e.g. solr or ES) > >> > > > - We'd probably want to provide transparent read support > downstream > >> > > > which supports merging for batch read: > >> > > > - a spark dataframe > >> > > > - a hive serde > >> > > > > >> > > > What I'd like to get out of this discussion is an architecture > >> document > >> > > > with a suggested approach and the necessary JIRAs to split this > up. > >> If > >> > > > anyone has suggestions or comments about any of this, please > speak > >> up. > >> > > I'd > >> > > > like to actually get this done in the near-term. :) > >> > > > > >> > > > Best, > >> > > > > >> > > > Casey > >> > > > >> > > >> -- > >> > >> Jon > > ------------------- > Thank you, > > James Sirota > PPMC- Apache Metron (Incubating) > jsirota AT apache DOT org >
