Re: [DISCUSS] Mutation of Indexed Data

[email protected] Thu, 22 Jun 2017 12:53:27 -0700

The key should be a solved problem as of METRON-765
<https://github.com/apache/metron/commit/27b0d6e31de94317b085766349a892395f0d3309>,
right?  It provides a single key for a given message that globally stored
with the message, regardless of where/how.


Jon

On Thu, Jun 22, 2017 at 9:01 AM Justin Leet <[email protected]> wrote:

> First off, I agree with the characteristics.
>
> For the data stores, we'll need to be able to make sure we can actually
> handle the collapsing of the updates into a single view.  Casey mentioned
> making the long term stores transparent, but there's potentially work for
> the near real time stores: we need to make sure we actually do updates,
> rather than create new docs that aren't linked to the old ones. This should
> be entirely transparent and handled by a service layer, rather than
> anything hardcoded to a datastore.
>
> For ES at least, the only way to do this is to retrieve, mutate it, and
> then reindex (even the updates API does that dance under the hood for you,
> and since we're potentially doing non trivial changes we might need to
> manage it ourselves).  This implies the existence of a key, even if one
> isn't enforced by ES (Which I don't believe it will be).  We need to be
> able to grab the doc(s?) to be updated, not end up with similar ones that
> shouldn't be mutated.  I assume this is also true (at least the
> generalities) of Solr as well.
>
> In concert with your other thread, couldn't part of this key end up being
> metadata (either user defined or environment defined)?  For example, in a
> situation where customer id is applied as metadata, it's possible two
> customers feed off the same datasource, but may need to mutate
> independently.  At this point, we have metadata that is effectively keyed.
> We don't want to update both docs, but there's not a real way to
> distinguish them.  And maybe that's something we push off for the short
> term, but it seems potentially nontrivial.
>
> In terms of consistency, I'd definitely agree that the long-term storage
> can be eventually consistent.  Any type of bulk spelunking, Spark jobs,
> dashboarding, etc. shouldn't need up to the millisecond data.
>
> Basically, I'm thinking the real time store is the snapshot of current
> state, and the long term store is the full record complete with the lineage
> history.
>
> I'm also interested in people's opinions on how we want to manage HDFS.
> Assuming we do use HBase to store our updates, that means that every HDFS
> op has to join onto that HBase table to get any updates that HDFS is
> missing (unless we implement some writeback and merge for HDFS data).  I'm
> worried that our two datastores are really: ES, HDFS+HBase.  And that
> keeping that data actually synced to end users is going to be painful.
>
> Justin
>
>
> On Wed, Jun 21, 2017 at 10:18 PM, Simon Elliston Ball <
> [email protected]> wrote:
>
> > I'd say that was an excellent set of requirements (very similar to the
> one
> > we arrived on with the last discuss thread on this)
> >
> > My vote remains a transaction log in hbase given the relatively low
> volume
> > (human scale) i would not expect this to need anything fancy like
> > compaction into hdfs state, but that does make a good argument for a long
> > term dataframe solution for spark, with a short term stop gap using a
> > joined data frame and shc.
> >
> > Simon
> >
> > Sent from my iPhone
> >
> > > On 22 Jun 2017, at 05:11, Otto Fowler <[email protected]> wrote:
> > >
> > > Can you clarify what data stores are at play here?
> > >
> > >
> > > On June 21, 2017 at 17:07:42, Casey Stella ([email protected]) wrote:
> > >
> > > Hi All,
> > >
> > > I know we've had a couple of these already, but we're due for another
> > > discussion of a sensible approach to mutating indexed data. The
> > motivation
> > > for this is users will want to update fields to correct and augment
> data.
> > > These corrections are invaluable for things like feedback for ML models
> > or
> > > just plain providing better context when evaluating alerts, etc.
> > >
> > > Rather than posing a solution, I'd like to pose the characteristics of
> a
> > > solution and we can fight about those first. ;)
> > >
> > > In my mind, the following are the characteristics that I'd look for:
> > >
> > > - Changes should be considered additional or replacement fields for
> > > existing fields
> > > - Changes need to be available in the web view in near real time (on
> the
> > > order of milliseconds)
> > > - Changes should be available in the batch view
> > > - I'd be ok with eventually consistent with the web view, thoughts?
> > > - Changes should have lineage preserved
> > > - Current value is the optimized path
> > > - Lineage search is the less optimized path
> > > - If HBase is part of a solution
> > > - maintain a scan-free solution
> > > - maintain a coprocessor-free solution
> > >
> > > Most of what I've thought of is something along the lines:
> > >
> > > - Diffs are stored in columns in a HBase row(s)
> > > - row: GUID:current would have one column with the current
> > > representation
> > > - row: GUID:lineage would have an ordered set of columns representing
> > > the lineage diffs
> > > - Mutable indices is directly updated (e.g. solr or ES)
> > > - We'd probably want to provide transparent read support downstream
> > > which supports merging for batch read:
> > > - a spark dataframe
> > > - a hive serde
> > >
> > > What I'd like to get out of this discussion is an architecture document
> > > with a suggested approach and the necessary JIRAs to split this up. If
> > > anyone has suggestions or comments about any of this, please speak up.
> > I'd
> > > like to actually get this done in the near-term. :)
> > >
> > > Best,
> > >
> > > Casey
> >
>
-- 

Jon

Re: [DISCUSS] Mutation of Indexed Data

Reply via email to