I'll answer the last question first:

Many data groups are processed via Airflow, so having a batch component
compatible with Airflow is more impactful than being able to live stream
data as it stands right now. I'm constantly on the lookout for a use case
where druid streaming is a good fit for a solution (as opposed to
Graphite/grafana, or even potentially prometheus) but haven't found one yet
where the overhead for maintaining the extra realtime and streaming system
is worth the payout. From a technology investment point of view, a Beam
compatible sink (which we have an internal one based on tranquility for
streaming sinks) might end up working. I am interested to see if the KIS
features can be leveraged to work with systems outside of kafka. Also of
great interest is to see if the "resources per task" can be made more
tunable instead of being a single cookie cutter footprint. The need for
huge resources during the final merge-and-push phase compared to the
incremental intake phase is also a major pain point and cause of
inefficiency for Druid streaming stuff.

Watermarking *could* tell if segments are unavailable (i.e. a whole hour of
data is missing) and fail the query accordingly if the watermark cursor was
not advanced beyond the interval end. I have not attempted to put such an
interrupt into the query layer though. It is a very intriguing idea. In
general the cursors work by monitoring the segment availability
announcements and watches for certain criteria to be met before advancing.
A very simple example here would be to halt a watermark's progression until
at least *some* data for a time range is available in some segment
somewhere. A more advanced cursor would have a concept of "completeness"
and only advance the watermark once some time range has reached some
"complete" criteria (number of events, or signal from external system could
make sense).

The nice thing here is also with automated checks, which can wait until the
watermark has progressed before querying the druid cluster for some data.

Hopefully that answers some questions,
Charles Allen


On Mon, Jan 7, 2019 at 12:50 PM Gian Merlino <g...@apache.org> wrote:

> For Kafka, maybe something that tells you if all committed data is actually
> loaded, & what offset has been committed up to? Would there by any problems
> caused by the fact that only the most recent commit is saved in the DB?
>
> Is this feature connected at all to an ask I have heard from a few people:
> that there be an option to fail a query (or at least include a special
> response header) if some segments in the interval are unavailable? (Which,
> currently, the broker can't know since it doesn't know details about all
> available segments.)
>
> Btw, at your site do you have any plans to migrate to Kafka indexing?
>
> On Wed, Jan 2, 2019 at 5:37 PM Charles Allen <charles.al...@snap.com
> .invalid>
> wrote:
>
> > Hi all!
> >
> > https://github.com/apache/incubator-druid/pull/6799
> >
> > A contribution is up that includes a neat feature we have been using
> > internally called Watermarks. Basically when operating a large scale and
> > multi-tenant system, it is handy to be able to monitor how 'well behaved'
> > the data is with regard to history. This is commonly used to spot holes
> in
> > data, and to help give hints to data consumers in a lambda environment on
> > when data has been run through a thorough check (batch job) vs a best
> > effort sketch of the results which may or may not handle late data well
> > (streaming intake).
> >
> > Unfortunately i'm not really sure what meta-data would be handy to have
> for
> > the kafka indexing service, so I'd love input there as well if anyone
> knows
> > of any "watermarks" that would make sense for it.
> >
> > Since the extension was written to be a stand alone service, it can
> remain
> > as an extension forever if desired. An alternative I would like to
> propose
> > is that the primitives for the watermark feature be added to core druid,
> > and the extension points be added to their respective places (mysql
> > extension and google extension to name two explicitly).
> >
> > Let me know what you think!
> > Charles Allen
> >
>

Reply via email to