First off, I agree with the characteristics.

For the data stores, we'll need to be able to make sure we can actually
handle the collapsing of the updates into a single view.  Casey mentioned
making the long term stores transparent, but there's potentially work for
the near real time stores: we need to make sure we actually do updates,
rather than create new docs that aren't linked to the old ones. This should
be entirely transparent and handled by a service layer, rather than
anything hardcoded to a datastore.

For ES at least, the only way to do this is to retrieve, mutate it, and
then reindex (even the updates API does that dance under the hood for you,
and since we're potentially doing non trivial changes we might need to
manage it ourselves).  This implies the existence of a key, even if one
isn't enforced by ES (Which I don't believe it will be).  We need to be
able to grab the doc(s?) to be updated, not end up with similar ones that
shouldn't be mutated.  I assume this is also true (at least the
generalities) of Solr as well.

In concert with your other thread, couldn't part of this key end up being
metadata (either user defined or environment defined)?  For example, in a
situation where customer id is applied as metadata, it's possible two
customers feed off the same datasource, but may need to mutate
independently.  At this point, we have metadata that is effectively keyed.
We don't want to update both docs, but there's not a real way to
distinguish them.  And maybe that's something we push off for the short
term, but it seems potentially nontrivial.

In terms of consistency, I'd definitely agree that the long-term storage
can be eventually consistent.  Any type of bulk spelunking, Spark jobs,
dashboarding, etc. shouldn't need up to the millisecond data.

Basically, I'm thinking the real time store is the snapshot of current
state, and the long term store is the full record complete with the lineage
history.

I'm also interested in people's opinions on how we want to manage HDFS.
Assuming we do use HBase to store our updates, that means that every HDFS
op has to join onto that HBase table to get any updates that HDFS is
missing (unless we implement some writeback and merge for HDFS data).  I'm
worried that our two datastores are really: ES, HDFS+HBase.  And that
keeping that data actually synced to end users is going to be painful.

Justin


On Wed, Jun 21, 2017 at 10:18 PM, Simon Elliston Ball <
[email protected]> wrote:

> I'd say that was an excellent set of requirements (very similar to the one
> we arrived on with the last discuss thread on this)
>
> My vote remains a transaction log in hbase given the relatively low volume
> (human scale) i would not expect this to need anything fancy like
> compaction into hdfs state, but that does make a good argument for a long
> term dataframe solution for spark, with a short term stop gap using a
> joined data frame and shc.
>
> Simon
>
> Sent from my iPhone
>
> > On 22 Jun 2017, at 05:11, Otto Fowler <[email protected]> wrote:
> >
> > Can you clarify what data stores are at play here?
> >
> >
> > On June 21, 2017 at 17:07:42, Casey Stella ([email protected]) wrote:
> >
> > Hi All,
> >
> > I know we've had a couple of these already, but we're due for another
> > discussion of a sensible approach to mutating indexed data. The
> motivation
> > for this is users will want to update fields to correct and augment data.
> > These corrections are invaluable for things like feedback for ML models
> or
> > just plain providing better context when evaluating alerts, etc.
> >
> > Rather than posing a solution, I'd like to pose the characteristics of a
> > solution and we can fight about those first. ;)
> >
> > In my mind, the following are the characteristics that I'd look for:
> >
> > - Changes should be considered additional or replacement fields for
> > existing fields
> > - Changes need to be available in the web view in near real time (on the
> > order of milliseconds)
> > - Changes should be available in the batch view
> > - I'd be ok with eventually consistent with the web view, thoughts?
> > - Changes should have lineage preserved
> > - Current value is the optimized path
> > - Lineage search is the less optimized path
> > - If HBase is part of a solution
> > - maintain a scan-free solution
> > - maintain a coprocessor-free solution
> >
> > Most of what I've thought of is something along the lines:
> >
> > - Diffs are stored in columns in a HBase row(s)
> > - row: GUID:current would have one column with the current
> > representation
> > - row: GUID:lineage would have an ordered set of columns representing
> > the lineage diffs
> > - Mutable indices is directly updated (e.g. solr or ES)
> > - We'd probably want to provide transparent read support downstream
> > which supports merging for batch read:
> > - a spark dataframe
> > - a hive serde
> >
> > What I'd like to get out of this discussion is an architecture document
> > with a suggested approach and the necessary JIRAs to split this up. If
> > anyone has suggestions or comments about any of this, please speak up.
> I'd
> > like to actually get this done in the near-term. :)
> >
> > Best,
> >
> > Casey
>

Reply via email to