First off, I agree with the characteristics. For the data stores, we'll need to be able to make sure we can actually handle the collapsing of the updates into a single view. Casey mentioned making the long term stores transparent, but there's potentially work for the near real time stores: we need to make sure we actually do updates, rather than create new docs that aren't linked to the old ones. This should be entirely transparent and handled by a service layer, rather than anything hardcoded to a datastore.
For ES at least, the only way to do this is to retrieve, mutate it, and then reindex (even the updates API does that dance under the hood for you, and since we're potentially doing non trivial changes we might need to manage it ourselves). This implies the existence of a key, even if one isn't enforced by ES (Which I don't believe it will be). We need to be able to grab the doc(s?) to be updated, not end up with similar ones that shouldn't be mutated. I assume this is also true (at least the generalities) of Solr as well. In concert with your other thread, couldn't part of this key end up being metadata (either user defined or environment defined)? For example, in a situation where customer id is applied as metadata, it's possible two customers feed off the same datasource, but may need to mutate independently. At this point, we have metadata that is effectively keyed. We don't want to update both docs, but there's not a real way to distinguish them. And maybe that's something we push off for the short term, but it seems potentially nontrivial. In terms of consistency, I'd definitely agree that the long-term storage can be eventually consistent. Any type of bulk spelunking, Spark jobs, dashboarding, etc. shouldn't need up to the millisecond data. Basically, I'm thinking the real time store is the snapshot of current state, and the long term store is the full record complete with the lineage history. I'm also interested in people's opinions on how we want to manage HDFS. Assuming we do use HBase to store our updates, that means that every HDFS op has to join onto that HBase table to get any updates that HDFS is missing (unless we implement some writeback and merge for HDFS data). I'm worried that our two datastores are really: ES, HDFS+HBase. And that keeping that data actually synced to end users is going to be painful. Justin On Wed, Jun 21, 2017 at 10:18 PM, Simon Elliston Ball < [email protected]> wrote: > I'd say that was an excellent set of requirements (very similar to the one > we arrived on with the last discuss thread on this) > > My vote remains a transaction log in hbase given the relatively low volume > (human scale) i would not expect this to need anything fancy like > compaction into hdfs state, but that does make a good argument for a long > term dataframe solution for spark, with a short term stop gap using a > joined data frame and shc. > > Simon > > Sent from my iPhone > > > On 22 Jun 2017, at 05:11, Otto Fowler <[email protected]> wrote: > > > > Can you clarify what data stores are at play here? > > > > > > On June 21, 2017 at 17:07:42, Casey Stella ([email protected]) wrote: > > > > Hi All, > > > > I know we've had a couple of these already, but we're due for another > > discussion of a sensible approach to mutating indexed data. The > motivation > > for this is users will want to update fields to correct and augment data. > > These corrections are invaluable for things like feedback for ML models > or > > just plain providing better context when evaluating alerts, etc. > > > > Rather than posing a solution, I'd like to pose the characteristics of a > > solution and we can fight about those first. ;) > > > > In my mind, the following are the characteristics that I'd look for: > > > > - Changes should be considered additional or replacement fields for > > existing fields > > - Changes need to be available in the web view in near real time (on the > > order of milliseconds) > > - Changes should be available in the batch view > > - I'd be ok with eventually consistent with the web view, thoughts? > > - Changes should have lineage preserved > > - Current value is the optimized path > > - Lineage search is the less optimized path > > - If HBase is part of a solution > > - maintain a scan-free solution > > - maintain a coprocessor-free solution > > > > Most of what I've thought of is something along the lines: > > > > - Diffs are stored in columns in a HBase row(s) > > - row: GUID:current would have one column with the current > > representation > > - row: GUID:lineage would have an ordered set of columns representing > > the lineage diffs > > - Mutable indices is directly updated (e.g. solr or ES) > > - We'd probably want to provide transparent read support downstream > > which supports merging for batch read: > > - a spark dataframe > > - a hive serde > > > > What I'd like to get out of this discussion is an architecture document > > with a suggested approach and the necessary JIRAs to split this up. If > > anyone has suggestions or comments about any of this, please speak up. > I'd > > like to actually get this done in the near-term. :) > > > > Best, > > > > Casey >
