Re: [DISCUSS] Mutation of Indexed Data

James Sirota Mon, 26 Jun 2017 09:37:33 -0700

It is clear to me that we need an independently-stored transaction log that is 
de-coupled from any of our existing systems.  So Simon’s idea of storing the 
transaction logs in Hbase and being able to reference them via a global ID 
resonates with me.  I like it for the following reasons:


- It makes Metron more pluggable as far as adding additional sources for data 
storage (for example a graph data base) as well as disabling existing data 
sources.  

- It makes enforcing consistency of data between data sources easier.  Each 
data storage system can be pointed to look at the transaction log so when user 
modifies data in system X and it gets recorded in the transaction log, systems 
Y and Z can listen for this change and adjust their data accordingly based on 
the global ID

Thanks, James


22.06.2017, 14:09, "Justin Leet" <[email protected]>:
> Thanks, Jon, that looks like it should work for the key. I didn't realize
> that guid got handled that way, which makes life much easier there. Almost
> like we already needed to identify messages or something. At the point we
> should be good, since we can easily retrieve, update, put on it.
>
> We'll also need to make sure any long term storage solution also uses it.
>
> On Thu, Jun 22, 2017 at 12:52 PM, [email protected] <[email protected]> wrote:
>
>>  The key should be a solved problem as of METRON-765
>>  <https://github.com/apache/metron/commit/27b0d6e31de94317b085766349a892
>>  395f0d3309>,
>>  right? It provides a single key for a given message that globally stored
>>  with the message, regardless of where/how.
>>
>>  Jon
>>
>>  On Thu, Jun 22, 2017 at 9:01 AM Justin Leet <[email protected]> wrote:
>>
>>  > First off, I agree with the characteristics.
>>  >
>>  > For the data stores, we'll need to be able to make sure we can actually
>>  > handle the collapsing of the updates into a single view. Casey mentioned
>>  > making the long term stores transparent, but there's potentially work for
>>  > the near real time stores: we need to make sure we actually do updates,
>>  > rather than create new docs that aren't linked to the old ones. This
>>  should
>>  > be entirely transparent and handled by a service layer, rather than
>>  > anything hardcoded to a datastore.
>>  >
>>  > For ES at least, the only way to do this is to retrieve, mutate it, and
>>  > then reindex (even the updates API does that dance under the hood for
>>  you,
>>  > and since we're potentially doing non trivial changes we might need to
>>  > manage it ourselves). This implies the existence of a key, even if one
>>  > isn't enforced by ES (Which I don't believe it will be). We need to be
>>  > able to grab the doc(s?) to be updated, not end up with similar ones that
>>  > shouldn't be mutated. I assume this is also true (at least the
>>  > generalities) of Solr as well.
>>  >
>>  > In concert with your other thread, couldn't part of this key end up being
>>  > metadata (either user defined or environment defined)? For example, in a
>>  > situation where customer id is applied as metadata, it's possible two
>>  > customers feed off the same datasource, but may need to mutate
>>  > independently. At this point, we have metadata that is effectively
>>  keyed.
>>  > We don't want to update both docs, but there's not a real way to
>>  > distinguish them. And maybe that's something we push off for the short
>>  > term, but it seems potentially nontrivial.
>>  >
>>  > In terms of consistency, I'd definitely agree that the long-term storage
>>  > can be eventually consistent. Any type of bulk spelunking, Spark jobs,
>>  > dashboarding, etc. shouldn't need up to the millisecond data.
>>  >
>>  > Basically, I'm thinking the real time store is the snapshot of current
>>  > state, and the long term store is the full record complete with the
>>  lineage
>>  > history.
>>  >
>>  > I'm also interested in people's opinions on how we want to manage HDFS.
>>  > Assuming we do use HBase to store our updates, that means that every HDFS
>>  > op has to join onto that HBase table to get any updates that HDFS is
>>  > missing (unless we implement some writeback and merge for HDFS data).
>>  I'm
>>  > worried that our two datastores are really: ES, HDFS+HBase. And that
>>  > keeping that data actually synced to end users is going to be painful.
>>  >
>>  > Justin
>>  >
>>  >
>>  > On Wed, Jun 21, 2017 at 10:18 PM, Simon Elliston Ball <
>>  > [email protected]> wrote:
>>  >
>>  > > I'd say that was an excellent set of requirements (very similar to the
>>  > one
>>  > > we arrived on with the last discuss thread on this)
>>  > >
>>  > > My vote remains a transaction log in hbase given the relatively low
>>  > volume
>>  > > (human scale) i would not expect this to need anything fancy like
>>  > > compaction into hdfs state, but that does make a good argument for a
>>  long
>>  > > term dataframe solution for spark, with a short term stop gap using a
>>  > > joined data frame and shc.
>>  > >
>>  > > Simon
>>  > >
>>  > > Sent from my iPhone
>>  > >
>>  > > > On 22 Jun 2017, at 05:11, Otto Fowler <[email protected]>
>>  wrote:
>>  > > >
>>  > > > Can you clarify what data stores are at play here?
>>  > > >
>>  > > >
>>  > > > On June 21, 2017 at 17:07:42, Casey Stella ([email protected])
>>  wrote:
>>  > > >
>>  > > > Hi All,
>>  > > >
>>  > > > I know we've had a couple of these already, but we're due for another
>>  > > > discussion of a sensible approach to mutating indexed data. The
>>  > > motivation
>>  > > > for this is users will want to update fields to correct and augment
>>  > data.
>>  > > > These corrections are invaluable for things like feedback for ML
>>  models
>>  > > or
>>  > > > just plain providing better context when evaluating alerts, etc.
>>  > > >
>>  > > > Rather than posing a solution, I'd like to pose the characteristics
>>  of
>>  > a
>>  > > > solution and we can fight about those first. ;)
>>  > > >
>>  > > > In my mind, the following are the characteristics that I'd look for:
>>  > > >
>>  > > > - Changes should be considered additional or replacement fields for
>>  > > > existing fields
>>  > > > - Changes need to be available in the web view in near real time (on
>>  > the
>>  > > > order of milliseconds)
>>  > > > - Changes should be available in the batch view
>>  > > > - I'd be ok with eventually consistent with the web view, thoughts?
>>  > > > - Changes should have lineage preserved
>>  > > > - Current value is the optimized path
>>  > > > - Lineage search is the less optimized path
>>  > > > - If HBase is part of a solution
>>  > > > - maintain a scan-free solution
>>  > > > - maintain a coprocessor-free solution
>>  > > >
>>  > > > Most of what I've thought of is something along the lines:
>>  > > >
>>  > > > - Diffs are stored in columns in a HBase row(s)
>>  > > > - row: GUID:current would have one column with the current
>>  > > > representation
>>  > > > - row: GUID:lineage would have an ordered set of columns representing
>>  > > > the lineage diffs
>>  > > > - Mutable indices is directly updated (e.g. solr or ES)
>>  > > > - We'd probably want to provide transparent read support downstream
>>  > > > which supports merging for batch read:
>>  > > > - a spark dataframe
>>  > > > - a hive serde
>>  > > >
>>  > > > What I'd like to get out of this discussion is an architecture
>>  document
>>  > > > with a suggested approach and the necessary JIRAs to split this up.
>>  If
>>  > > > anyone has suggestions or comments about any of this, please speak
>>  up.
>>  > > I'd
>>  > > > like to actually get this done in the near-term. :)
>>  > > >
>>  > > > Best,
>>  > > >
>>  > > > Casey
>>  > >
>>  >
>>  --
>>
>>  Jon

------------------- 
Thank you,

James Sirota
PPMC- Apache Metron (Incubating)
jsirota AT apache DOT org

Re: [DISCUSS] Mutation of Indexed Data

Reply via email to