Re: [DISCUSS] Mutation of Indexed Data

Otto Fowler Mon, 26 Jun 2017 09:55:38 -0700

This problem is not uncommon, I would think.  This should be implemented as
‘clean’ as possible such that it can be spun out.
It would also be a candidate for a feature/collaboration/long branch



On June 26, 2017 at 12:44:44, Casey Stella ([email protected]) wrote:

When we're talking about a "transaction log", an edit could involve
multiple delete/additions, so are we proposing storing a diff to the JSON
map as the representation of a particular transaction? I proposed
pre-caching the current value to lessen the burden on the reader (i.e. not
having to merge the transactions into current state), what do we think of
that?

Also, I want to ensure we maintain a solution that is scan-free: the edits
should exist as separate columns rather than separate rows in the NoSQL
store.

Thoughts?

On Mon, Jun 26, 2017 at 5:36 PM, James Sirota <[email protected]> wrote:

> It is clear to me that we need an independently-stored transaction log
> that is de-coupled from any of our existing systems. So Simon’s idea of
> storing the transaction logs in Hbase and being able to reference them
via
> a global ID resonates with me. I like it for the following reasons:
>
> - It makes Metron more pluggable as far as adding additional sources for
> data storage (for example a graph data base) as well as disabling
existing
> data sources.
>
> - It makes enforcing consistency of data between data sources easier.
> Each data storage system can be pointed to look at the transaction log so
> when user modifies data in system X and it gets recorded in the
transaction
> log, systems Y and Z can listen for this change and adjust their data
> accordingly based on the global ID
>
> Thanks, James
>
>
> 22.06.2017, 14:09, "Justin Leet" <[email protected]>:
> > Thanks, Jon, that looks like it should work for the key. I didn't
realize
> > that guid got handled that way, which makes life much easier there.
> Almost
> > like we already needed to identify messages or something. At the point
we
> > should be good, since we can easily retrieve, update, put on it.
> >
> > We'll also need to make sure any long term storage solution also uses
it.
> >
> > On Thu, Jun 22, 2017 at 12:52 PM, [email protected] <[email protected]>
> wrote:
> >
> >> The key should be a solved problem as of METRON-765
> >> <https://github.com/apache/metron/commit/
> 27b0d6e31de94317b085766349a892
> >> 395f0d3309>,
> >> right? It provides a single key for a given message that globally
> stored
> >> with the message, regardless of where/how.
> >>
> >> Jon
> >>
> >> On Thu, Jun 22, 2017 at 9:01 AM Justin Leet <[email protected]>
> wrote:
> >>
> >> > First off, I agree with the characteristics.
> >> >
> >> > For the data stores, we'll need to be able to make sure we can
> actually
> >> > handle the collapsing of the updates into a single view. Casey
> mentioned
> >> > making the long term stores transparent, but there's potentially
> work for
> >> > the near real time stores: we need to make sure we actually do
> updates,
> >> > rather than create new docs that aren't linked to the old ones. This
> >> should
> >> > be entirely transparent and handled by a service layer, rather than
> >> > anything hardcoded to a datastore.
> >> >
> >> > For ES at least, the only way to do this is to retrieve, mutate it,
> and
> >> > then reindex (even the updates API does that dance under the hood
for
> >> you,
> >> > and since we're potentially doing non trivial changes we might need
> to
> >> > manage it ourselves). This implies the existence of a key, even if
> one
> >> > isn't enforced by ES (Which I don't believe it will be). We need to
> be
> >> > able to grab the doc(s?) to be updated, not end up with similar ones
> that
> >> > shouldn't be mutated. I assume this is also true (at least the
> >> > generalities) of Solr as well.
> >> >
> >> > In concert with your other thread, couldn't part of this key end up
> being
> >> > metadata (either user defined or environment defined)? For example,
> in a
> >> > situation where customer id is applied as metadata, it's possible
two
> >> > customers feed off the same datasource, but may need to mutate
> >> > independently. At this point, we have metadata that is effectively
> >> keyed.
> >> > We don't want to update both docs, but there's not a real way to
> >> > distinguish them. And maybe that's something we push off for the
> short
> >> > term, but it seems potentially nontrivial.
> >> >
> >> > In terms of consistency, I'd definitely agree that the long-term
> storage
> >> > can be eventually consistent. Any type of bulk spelunking, Spark
> jobs,
> >> > dashboarding, etc. shouldn't need up to the millisecond data.
> >> >
> >> > Basically, I'm thinking the real time store is the snapshot of
> current
> >> > state, and the long term store is the full record complete with the
> >> lineage
> >> > history.
> >> >
> >> > I'm also interested in people's opinions on how we want to manage
> HDFS.
> >> > Assuming we do use HBase to store our updates, that means that every
> HDFS
> >> > op has to join onto that HBase table to get any updates that HDFS is
> >> > missing (unless we implement some writeback and merge for HDFS
data).
> >> I'm
> >> > worried that our two datastores are really: ES, HDFS+HBase. And that
> >> > keeping that data actually synced to end users is going to be
> painful.
> >> >
> >> > Justin
> >> >
> >> >
> >> > On Wed, Jun 21, 2017 at 10:18 PM, Simon Elliston Ball <
> >> > [email protected]> wrote:
> >> >
> >> > > I'd say that was an excellent set of requirements (very similar to
> the
> >> > one
> >> > > we arrived on with the last discuss thread on this)
> >> > >
> >> > > My vote remains a transaction log in hbase given the relatively
low
> >> > volume
> >> > > (human scale) i would not expect this to need anything fancy like
> >> > > compaction into hdfs state, but that does make a good argument for
> a
> >> long
> >> > > term dataframe solution for spark, with a short term stop gap
> using a
> >> > > joined data frame and shc.
> >> > >
> >> > > Simon
> >> > >
> >> > > Sent from my iPhone
> >> > >
> >> > > > On 22 Jun 2017, at 05:11, Otto Fowler <[email protected]>
> >> wrote:
> >> > > >
> >> > > > Can you clarify what data stores are at play here?
> >> > > >
> >> > > >
> >> > > > On June 21, 2017 at 17:07:42, Casey Stella ([email protected])
> >> wrote:
> >> > > >
> >> > > > Hi All,
> >> > > >
> >> > > > I know we've had a couple of these already, but we're due for
> another
> >> > > > discussion of a sensible approach to mutating indexed data. The
> >> > > motivation
> >> > > > for this is users will want to update fields to correct and
> augment
> >> > data.
> >> > > > These corrections are invaluable for things like feedback for ML
> >> models
> >> > > or
> >> > > > just plain providing better context when evaluating alerts, etc.
> >> > > >
> >> > > > Rather than posing a solution, I'd like to pose the
> characteristics
> >> of
> >> > a
> >> > > > solution and we can fight about those first. ;)
> >> > > >
> >> > > > In my mind, the following are the characteristics that I'd look
> for:
> >> > > >
> >> > > > - Changes should be considered additional or replacement fields
> for
> >> > > > existing fields
> >> > > > - Changes need to be available in the web view in near real time
> (on
> >> > the
> >> > > > order of milliseconds)
> >> > > > - Changes should be available in the batch view
> >> > > > - I'd be ok with eventually consistent with the web view,
> thoughts?
> >> > > > - Changes should have lineage preserved
> >> > > > - Current value is the optimized path
> >> > > > - Lineage search is the less optimized path
> >> > > > - If HBase is part of a solution
> >> > > > - maintain a scan-free solution
> >> > > > - maintain a coprocessor-free solution
> >> > > >
> >> > > > Most of what I've thought of is something along the lines:
> >> > > >
> >> > > > - Diffs are stored in columns in a HBase row(s)
> >> > > > - row: GUID:current would have one column with the current
> >> > > > representation
> >> > > > - row: GUID:lineage would have an ordered set of columns
> representing
> >> > > > the lineage diffs
> >> > > > - Mutable indices is directly updated (e.g. solr or ES)
> >> > > > - We'd probably want to provide transparent read support
> downstream
> >> > > > which supports merging for batch read:
> >> > > > - a spark dataframe
> >> > > > - a hive serde
> >> > > >
> >> > > > What I'd like to get out of this discussion is an architecture
> >> document
> >> > > > with a suggested approach and the necessary JIRAs to split this
> up.
> >> If
> >> > > > anyone has suggestions or comments about any of this, please
> speak
> >> up.
> >> > > I'd
> >> > > > like to actually get this done in the near-term. :)
> >> > > >
> >> > > > Best,
> >> > > >
> >> > > > Casey
> >> > >
> >> >
> >> --
> >>
> >> Jon
>
> -------------------
> Thank you,
>
> James Sirota
> PPMC- Apache Metron (Incubating)
> jsirota AT apache DOT org
>

Re: [DISCUSS] Mutation of Indexed Data

Reply via email to