Can you clarify what data stores are at play here?

On June 21, 2017 at 17:07:42, Casey Stella (ceste...@gmail.com) wrote:

Hi All,

I know we've had a couple of these already, but we're due for another
discussion of a sensible approach to mutating indexed data. The motivation
for this is users will want to update fields to correct and augment data.
These corrections are invaluable for things like feedback for ML models or
just plain providing better context when evaluating alerts, etc.

Rather than posing a solution, I'd like to pose the characteristics of a
solution and we can fight about those first. ;)

In my mind, the following are the characteristics that I'd look for:

- Changes should be considered additional or replacement fields for
existing fields
- Changes need to be available in the web view in near real time (on the
order of milliseconds)
- Changes should be available in the batch view
- I'd be ok with eventually consistent with the web view, thoughts?
- Changes should have lineage preserved
- Current value is the optimized path
- Lineage search is the less optimized path
- If HBase is part of a solution
- maintain a scan-free solution
- maintain a coprocessor-free solution

Most of what I've thought of is something along the lines:

- Diffs are stored in columns in a HBase row(s)
- row: GUID:current would have one column with the current
representation
- row: GUID:lineage would have an ordered set of columns representing
the lineage diffs
- Mutable indices is directly updated (e.g. solr or ES)
- We'd probably want to provide transparent read support downstream
which supports merging for batch read:
- a spark dataframe
- a hive serde

What I'd like to get out of this discussion is an architecture document
with a suggested approach and the necessary JIRAs to split this up. If
anyone has suggestions or comments about any of this, please speak up. I'd
like to actually get this done in the near-term. :)

Best,

Casey

Reply via email to