I'll answer the last question first: Many data groups are processed via Airflow, so having a batch component compatible with Airflow is more impactful than being able to live stream data as it stands right now. I'm constantly on the lookout for a use case where druid streaming is a good fit for a solution (as opposed to Graphite/grafana, or even potentially prometheus) but haven't found one yet where the overhead for maintaining the extra realtime and streaming system is worth the payout. From a technology investment point of view, a Beam compatible sink (which we have an internal one based on tranquility for streaming sinks) might end up working. I am interested to see if the KIS features can be leveraged to work with systems outside of kafka. Also of great interest is to see if the "resources per task" can be made more tunable instead of being a single cookie cutter footprint. The need for huge resources during the final merge-and-push phase compared to the incremental intake phase is also a major pain point and cause of inefficiency for Druid streaming stuff.
Watermarking *could* tell if segments are unavailable (i.e. a whole hour of data is missing) and fail the query accordingly if the watermark cursor was not advanced beyond the interval end. I have not attempted to put such an interrupt into the query layer though. It is a very intriguing idea. In general the cursors work by monitoring the segment availability announcements and watches for certain criteria to be met before advancing. A very simple example here would be to halt a watermark's progression until at least *some* data for a time range is available in some segment somewhere. A more advanced cursor would have a concept of "completeness" and only advance the watermark once some time range has reached some "complete" criteria (number of events, or signal from external system could make sense). The nice thing here is also with automated checks, which can wait until the watermark has progressed before querying the druid cluster for some data. Hopefully that answers some questions, Charles Allen On Mon, Jan 7, 2019 at 12:50 PM Gian Merlino <g...@apache.org> wrote: > For Kafka, maybe something that tells you if all committed data is actually > loaded, & what offset has been committed up to? Would there by any problems > caused by the fact that only the most recent commit is saved in the DB? > > Is this feature connected at all to an ask I have heard from a few people: > that there be an option to fail a query (or at least include a special > response header) if some segments in the interval are unavailable? (Which, > currently, the broker can't know since it doesn't know details about all > available segments.) > > Btw, at your site do you have any plans to migrate to Kafka indexing? > > On Wed, Jan 2, 2019 at 5:37 PM Charles Allen <charles.al...@snap.com > .invalid> > wrote: > > > Hi all! > > > > https://github.com/apache/incubator-druid/pull/6799 > > > > A contribution is up that includes a neat feature we have been using > > internally called Watermarks. Basically when operating a large scale and > > multi-tenant system, it is handy to be able to monitor how 'well behaved' > > the data is with regard to history. This is commonly used to spot holes > in > > data, and to help give hints to data consumers in a lambda environment on > > when data has been run through a thorough check (batch job) vs a best > > effort sketch of the results which may or may not handle late data well > > (streaming intake). > > > > Unfortunately i'm not really sure what meta-data would be handy to have > for > > the kafka indexing service, so I'd love input there as well if anyone > knows > > of any "watermarks" that would make sense for it. > > > > Since the extension was written to be a stand alone service, it can > remain > > as an extension forever if desired. An alternative I would like to > propose > > is that the primitives for the watermark feature be added to core druid, > > and the extension points be added to their respective places (mysql > > extension and google extension to name two explicitly). > > > > Let me know what you think! > > Charles Allen > > >