Re: [DISCUSS] Mutation of Indexed Data

Casey Stella Mon, 26 Jun 2017 09:45:06 -0700

When we're talking about a "transaction log", an edit could involve
multiple delete/additions, so are we proposing storing a diff to the JSON
map as the representation of a particular transaction?  I proposed
pre-caching the current value to lessen the burden on the reader (i.e. not
having to merge the transactions into current state), what do we think of
that?


Also, I want to ensure we maintain a solution that is scan-free: the edits
should exist as separate columns rather than separate rows in the NoSQL
store.

Thoughts?

On Mon, Jun 26, 2017 at 5:36 PM, James Sirota <[email protected]> wrote:

> It is clear to me that we need an independently-stored transaction log
> that is de-coupled from any of our existing systems.  So Simon’s idea of
> storing the transaction logs in Hbase and being able to reference them via
> a global ID resonates with me.  I like it for the following reasons:
>
> - It makes Metron more pluggable as far as adding additional sources for
> data storage (for example a graph data base) as well as disabling existing
> data sources.
>
> - It makes enforcing consistency of data between data sources easier.
> Each data storage system can be pointed to look at the transaction log so
> when user modifies data in system X and it gets recorded in the transaction
> log, systems Y and Z can listen for this change and adjust their data
> accordingly based on the global ID
>
> Thanks, James
>
>
> 22.06.2017, 14:09, "Justin Leet" <[email protected]>:
> > Thanks, Jon, that looks like it should work for the key. I didn't realize
> > that guid got handled that way, which makes life much easier there.
> Almost
> > like we already needed to identify messages or something. At the point we
> > should be good, since we can easily retrieve, update, put on it.
> >
> > We'll also need to make sure any long term storage solution also uses it.
> >
> > On Thu, Jun 22, 2017 at 12:52 PM, [email protected] <[email protected]>
> wrote:
> >
> >>  The key should be a solved problem as of METRON-765
> >>  <https://github.com/apache/metron/commit/
> 27b0d6e31de94317b085766349a892
> >>  395f0d3309>,
> >>  right? It provides a single key for a given message that globally
> stored
> >>  with the message, regardless of where/how.
> >>
> >>  Jon
> >>
> >>  On Thu, Jun 22, 2017 at 9:01 AM Justin Leet <[email protected]>
> wrote:
> >>
> >>  > First off, I agree with the characteristics.
> >>  >
> >>  > For the data stores, we'll need to be able to make sure we can
> actually
> >>  > handle the collapsing of the updates into a single view. Casey
> mentioned
> >>  > making the long term stores transparent, but there's potentially
> work for
> >>  > the near real time stores: we need to make sure we actually do
> updates,
> >>  > rather than create new docs that aren't linked to the old ones. This
> >>  should
> >>  > be entirely transparent and handled by a service layer, rather than
> >>  > anything hardcoded to a datastore.
> >>  >
> >>  > For ES at least, the only way to do this is to retrieve, mutate it,
> and
> >>  > then reindex (even the updates API does that dance under the hood for
> >>  you,
> >>  > and since we're potentially doing non trivial changes we might need
> to
> >>  > manage it ourselves). This implies the existence of a key, even if
> one
> >>  > isn't enforced by ES (Which I don't believe it will be). We need to
> be
> >>  > able to grab the doc(s?) to be updated, not end up with similar ones
> that
> >>  > shouldn't be mutated. I assume this is also true (at least the
> >>  > generalities) of Solr as well.
> >>  >
> >>  > In concert with your other thread, couldn't part of this key end up
> being
> >>  > metadata (either user defined or environment defined)? For example,
> in a
> >>  > situation where customer id is applied as metadata, it's possible two
> >>  > customers feed off the same datasource, but may need to mutate
> >>  > independently. At this point, we have metadata that is effectively
> >>  keyed.
> >>  > We don't want to update both docs, but there's not a real way to
> >>  > distinguish them. And maybe that's something we push off for the
> short
> >>  > term, but it seems potentially nontrivial.
> >>  >
> >>  > In terms of consistency, I'd definitely agree that the long-term
> storage
> >>  > can be eventually consistent. Any type of bulk spelunking, Spark
> jobs,
> >>  > dashboarding, etc. shouldn't need up to the millisecond data.
> >>  >
> >>  > Basically, I'm thinking the real time store is the snapshot of
> current
> >>  > state, and the long term store is the full record complete with the
> >>  lineage
> >>  > history.
> >>  >
> >>  > I'm also interested in people's opinions on how we want to manage
> HDFS.
> >>  > Assuming we do use HBase to store our updates, that means that every
> HDFS
> >>  > op has to join onto that HBase table to get any updates that HDFS is
> >>  > missing (unless we implement some writeback and merge for HDFS data).
> >>  I'm
> >>  > worried that our two datastores are really: ES, HDFS+HBase. And that
> >>  > keeping that data actually synced to end users is going to be
> painful.
> >>  >
> >>  > Justin
> >>  >
> >>  >
> >>  > On Wed, Jun 21, 2017 at 10:18 PM, Simon Elliston Ball <
> >>  > [email protected]> wrote:
> >>  >
> >>  > > I'd say that was an excellent set of requirements (very similar to
> the
> >>  > one
> >>  > > we arrived on with the last discuss thread on this)
> >>  > >
> >>  > > My vote remains a transaction log in hbase given the relatively low
> >>  > volume
> >>  > > (human scale) i would not expect this to need anything fancy like
> >>  > > compaction into hdfs state, but that does make a good argument for
> a
> >>  long
> >>  > > term dataframe solution for spark, with a short term stop gap
> using a
> >>  > > joined data frame and shc.
> >>  > >
> >>  > > Simon
> >>  > >
> >>  > > Sent from my iPhone
> >>  > >
> >>  > > > On 22 Jun 2017, at 05:11, Otto Fowler <[email protected]>
> >>  wrote:
> >>  > > >
> >>  > > > Can you clarify what data stores are at play here?
> >>  > > >
> >>  > > >
> >>  > > > On June 21, 2017 at 17:07:42, Casey Stella ([email protected])
> >>  wrote:
> >>  > > >
> >>  > > > Hi All,
> >>  > > >
> >>  > > > I know we've had a couple of these already, but we're due for
> another
> >>  > > > discussion of a sensible approach to mutating indexed data. The
> >>  > > motivation
> >>  > > > for this is users will want to update fields to correct and
> augment
> >>  > data.
> >>  > > > These corrections are invaluable for things like feedback for ML
> >>  models
> >>  > > or
> >>  > > > just plain providing better context when evaluating alerts, etc.
> >>  > > >
> >>  > > > Rather than posing a solution, I'd like to pose the
> characteristics
> >>  of
> >>  > a
> >>  > > > solution and we can fight about those first. ;)
> >>  > > >
> >>  > > > In my mind, the following are the characteristics that I'd look
> for:
> >>  > > >
> >>  > > > - Changes should be considered additional or replacement fields
> for
> >>  > > > existing fields
> >>  > > > - Changes need to be available in the web view in near real time
> (on
> >>  > the
> >>  > > > order of milliseconds)
> >>  > > > - Changes should be available in the batch view
> >>  > > > - I'd be ok with eventually consistent with the web view,
> thoughts?
> >>  > > > - Changes should have lineage preserved
> >>  > > > - Current value is the optimized path
> >>  > > > - Lineage search is the less optimized path
> >>  > > > - If HBase is part of a solution
> >>  > > > - maintain a scan-free solution
> >>  > > > - maintain a coprocessor-free solution
> >>  > > >
> >>  > > > Most of what I've thought of is something along the lines:
> >>  > > >
> >>  > > > - Diffs are stored in columns in a HBase row(s)
> >>  > > > - row: GUID:current would have one column with the current
> >>  > > > representation
> >>  > > > - row: GUID:lineage would have an ordered set of columns
> representing
> >>  > > > the lineage diffs
> >>  > > > - Mutable indices is directly updated (e.g. solr or ES)
> >>  > > > - We'd probably want to provide transparent read support
> downstream
> >>  > > > which supports merging for batch read:
> >>  > > > - a spark dataframe
> >>  > > > - a hive serde
> >>  > > >
> >>  > > > What I'd like to get out of this discussion is an architecture
> >>  document
> >>  > > > with a suggested approach and the necessary JIRAs to split this
> up.
> >>  If
> >>  > > > anyone has suggestions or comments about any of this, please
> speak
> >>  up.
> >>  > > I'd
> >>  > > > like to actually get this done in the near-term. :)
> >>  > > >
> >>  > > > Best,
> >>  > > >
> >>  > > > Casey
> >>  > >
> >>  >
> >>  --
> >>
> >>  Jon
>
> -------------------
> Thank you,
>
> James Sirota
> PPMC- Apache Metron (Incubating)
> jsirota AT apache DOT org
>

Re: [DISCUSS] Mutation of Indexed Data

Reply via email to