Re: [DISCUSS] Mutation of Indexed Data

Justin Leet Thu, 22 Jun 2017 14:10:05 -0700

Thanks, Jon, that looks like it should work for the key.  I didn't realize
that guid got handled that way, which makes life much easier there.  Almost
like we already needed to identify messages or something.  At the point we
should be good, since we can easily retrieve, update, put on it.


We'll also need to make sure any long term storage solution also uses it.


On Thu, Jun 22, 2017 at 12:52 PM, [email protected] <[email protected]> wrote:

> The key should be a solved problem as of METRON-765
> <https://github.com/apache/metron/commit/27b0d6e31de94317b085766349a892
> 395f0d3309>,
> right?  It provides a single key for a given message that globally stored
> with the message, regardless of where/how.
>
> Jon
>
> On Thu, Jun 22, 2017 at 9:01 AM Justin Leet <[email protected]> wrote:
>
> > First off, I agree with the characteristics.
> >
> > For the data stores, we'll need to be able to make sure we can actually
> > handle the collapsing of the updates into a single view.  Casey mentioned
> > making the long term stores transparent, but there's potentially work for
> > the near real time stores: we need to make sure we actually do updates,
> > rather than create new docs that aren't linked to the old ones. This
> should
> > be entirely transparent and handled by a service layer, rather than
> > anything hardcoded to a datastore.
> >
> > For ES at least, the only way to do this is to retrieve, mutate it, and
> > then reindex (even the updates API does that dance under the hood for
> you,
> > and since we're potentially doing non trivial changes we might need to
> > manage it ourselves).  This implies the existence of a key, even if one
> > isn't enforced by ES (Which I don't believe it will be).  We need to be
> > able to grab the doc(s?) to be updated, not end up with similar ones that
> > shouldn't be mutated.  I assume this is also true (at least the
> > generalities) of Solr as well.
> >
> > In concert with your other thread, couldn't part of this key end up being
> > metadata (either user defined or environment defined)?  For example, in a
> > situation where customer id is applied as metadata, it's possible two
> > customers feed off the same datasource, but may need to mutate
> > independently.  At this point, we have metadata that is effectively
> keyed.
> > We don't want to update both docs, but there's not a real way to
> > distinguish them.  And maybe that's something we push off for the short
> > term, but it seems potentially nontrivial.
> >
> > In terms of consistency, I'd definitely agree that the long-term storage
> > can be eventually consistent.  Any type of bulk spelunking, Spark jobs,
> > dashboarding, etc. shouldn't need up to the millisecond data.
> >
> > Basically, I'm thinking the real time store is the snapshot of current
> > state, and the long term store is the full record complete with the
> lineage
> > history.
> >
> > I'm also interested in people's opinions on how we want to manage HDFS.
> > Assuming we do use HBase to store our updates, that means that every HDFS
> > op has to join onto that HBase table to get any updates that HDFS is
> > missing (unless we implement some writeback and merge for HDFS data).
> I'm
> > worried that our two datastores are really: ES, HDFS+HBase.  And that
> > keeping that data actually synced to end users is going to be painful.
> >
> > Justin
> >
> >
> > On Wed, Jun 21, 2017 at 10:18 PM, Simon Elliston Ball <
> > [email protected]> wrote:
> >
> > > I'd say that was an excellent set of requirements (very similar to the
> > one
> > > we arrived on with the last discuss thread on this)
> > >
> > > My vote remains a transaction log in hbase given the relatively low
> > volume
> > > (human scale) i would not expect this to need anything fancy like
> > > compaction into hdfs state, but that does make a good argument for a
> long
> > > term dataframe solution for spark, with a short term stop gap using a
> > > joined data frame and shc.
> > >
> > > Simon
> > >
> > > Sent from my iPhone
> > >
> > > > On 22 Jun 2017, at 05:11, Otto Fowler <[email protected]>
> wrote:
> > > >
> > > > Can you clarify what data stores are at play here?
> > > >
> > > >
> > > > On June 21, 2017 at 17:07:42, Casey Stella ([email protected])
> wrote:
> > > >
> > > > Hi All,
> > > >
> > > > I know we've had a couple of these already, but we're due for another
> > > > discussion of a sensible approach to mutating indexed data. The
> > > motivation
> > > > for this is users will want to update fields to correct and augment
> > data.
> > > > These corrections are invaluable for things like feedback for ML
> models
> > > or
> > > > just plain providing better context when evaluating alerts, etc.
> > > >
> > > > Rather than posing a solution, I'd like to pose the characteristics
> of
> > a
> > > > solution and we can fight about those first. ;)
> > > >
> > > > In my mind, the following are the characteristics that I'd look for:
> > > >
> > > > - Changes should be considered additional or replacement fields for
> > > > existing fields
> > > > - Changes need to be available in the web view in near real time (on
> > the
> > > > order of milliseconds)
> > > > - Changes should be available in the batch view
> > > > - I'd be ok with eventually consistent with the web view, thoughts?
> > > > - Changes should have lineage preserved
> > > > - Current value is the optimized path
> > > > - Lineage search is the less optimized path
> > > > - If HBase is part of a solution
> > > > - maintain a scan-free solution
> > > > - maintain a coprocessor-free solution
> > > >
> > > > Most of what I've thought of is something along the lines:
> > > >
> > > > - Diffs are stored in columns in a HBase row(s)
> > > > - row: GUID:current would have one column with the current
> > > > representation
> > > > - row: GUID:lineage would have an ordered set of columns representing
> > > > the lineage diffs
> > > > - Mutable indices is directly updated (e.g. solr or ES)
> > > > - We'd probably want to provide transparent read support downstream
> > > > which supports merging for batch read:
> > > > - a spark dataframe
> > > > - a hive serde
> > > >
> > > > What I'd like to get out of this discussion is an architecture
> document
> > > > with a suggested approach and the necessary JIRAs to split this up.
> If
> > > > anyone has suggestions or comments about any of this, please speak
> up.
> > > I'd
> > > > like to actually get this done in the near-term. :)
> > > >
> > > > Best,
> > > >
> > > > Casey
> > >
> >
> --
>
> Jon
>

Re: [DISCUSS] Mutation of Indexed Data

Reply via email to