Charles, 
All great questions. I have answers only to part of them :) 
- People struggled with reasoning about eventual consistency model at least 
since NoSQL is around. To the best of my knowledge, no "golden standard" 
emerged. In a system that maintains approximate information anyways, the 
question of whether consistency is required is fair. One could imagine 
scenarios like outlier detection in which outliers can be flagged in one 
dataset but not supported by related data in related datasets in the real time, 
due to data races. 
- As for the alternatives, Omid is an Apache Incubator project that has been 
extensively used at Yahoo/Oath, and is being integrated with Apache Phoenix 
these days. Phoenix contends for the same spot as Druid (real-time analytics). 
It supports standard SQL semantics, including ACID transactions. Omid is super 
scalable and lock free - data ingestion does not block analytics. It employs 
data multi-versioning built in HBase. For Druid, it's not hard to introduce a 
similar mechanism into IncrementalIndex; we can have this discussion later on 
if the current discussion bears fruit. Here's some background on Omid: 
https://www.usenix.org/system/files/conference/fast17/fast17-shacham.pdf - and 
we have improved it further on since then.  Best, Edward
PS +Ohad Shacham (Omid committer and co-owner)
    On Wednesday, May 30, 2018, 9:27:39 AM PDT, Charles Allen 
<[email protected]> wrote:  
 
 To throw in my two cents.

A lot of items in Druid trade off accuracy for speed and approximations.
There are things like aliasing effects from the scatter-gather in high
cardinality topN queries as well as hyper-log-log approximations for
cardinality estimators that favor speed over accuracy.

To go along with this, I am interested in hearing more about the use cases
and see if there are things that can be done to achieve the results
desired. Ideally without impacting performance across the board.

Examples of things that I'm curious about:

  - Who is consuming this data? Is it a machine? Is it a human decision
  maker?
  - What is the expectation on data coming back while a transaction is in
  progress? Should the result just be eventually consistent? Should reads
  block or fail until the transaction is complete? Does the data *really*
  need to be read during or shortly after a transaction, or can an "old"
  state be used and just be updated eventually?
  - Is there a progression of a "usually  immutable" watermark, or is data
  always considered ad-hoc mutable? Is there a time component to the
  mutability, ex: data is mutable within the last 24 hours but usually not
  after that? With how druid is currently architected this would tie into an
  expectation of TTL for when you can read your writes.
  - Is there some other project (Apache or otherwise) that solves such a
  problem?

Cheers,
Charles Allen


On Wed, May 30, 2018 at 8:52 AM Edward Bortnikov <[email protected]>
wrote:

>  Hi Gian,
>
> Thanks for the explanation.
> So, the community does not envision traditional OLTP/RT analytics use
> cases like "read A-compute-write B", cross-partition consistent scans, or
> atomic updates of multiple indexes? The reason I'm asking is that we also
> work on the Omid transaction processor for Hadoop, which might be adopted
> to Druid in case of need.
> Thanks,Edward
>
>    On Tuesday, May 29, 2018, 7:09:27 PM PDT, Gian Merlino <
> [email protected]> wrote:
>
>  Hi Edward,
>
> There are a couple of ways to do transactional updates to Druid today:
>
> 1) When using streaming ingestion with the "parseBatch" feature
> (see io.druid.data.input.impl.InputRowParser), rows in the batch are
> inserted transactionally.
> 2) When using batch ingestion in overwrite mode (the default), all
> operations are transactional. Of course, you must be willing to overwrite
> an entire time interval in this mode.
>
> With (1) scans are not transactionally consistent.
>
> With (2) scans of a particular interval _are_ transactionally consistent,
> due to the nature of how we handle overwrite-style ingestion (the new
> segment set has a higher version number, and queries will use either the
> older or newer version).
>
> However Druid never offers read-your-writes consistency. There is always
> some delay (however small) between when you trigger an insert and when that
> insert is actually readable.
>
> On Tue, May 29, 2018 at 4:38 PM, Edward Bortnikov
> <[email protected]
> > wrote:
>
> > Hi, Community,
> > Do we have any existing or perceived use cases of transactional
> > (multi-row) updates to Druid? Same about transactionally consistent
> scans?
> > Thanks, Edward
>
  

Reply via email to