Hello,

I work closely with druid.io and one of the main pain points for any Druid
deployment is handling the real-time streaming component. What I,
personally, would *like* to have is the streaming orchestration and
streaming state handled by a runner which specializes in such things, and
allow Druid to focus on the lightning fast ad-hoc query side.

A natural contender for such a setup would be a Beam based solution with a
Druid segment Sink. The trouble we are having pursuing such a setup is two
fold:

   1.  Many Druid setups run lambda-style pipes to backfill late or wrong
   data, so the sink needs to call out to the Druid cluster for data version
   locking and orchestration. Access must be available from the sink to the
   Druid Overlord and/or Coordinator, or potentially some other task-specific
   jvm in the cluster.
   2. Druid segments are queryable while they are being built. This means
   that the Druid cluster (Broker specifically) must be able to discover and
   issue *RPC queries* against the Sink on the partial data it has
   accumulated. This puts an extra dynamic load on the Sink jvm, but in my own
   experience that extra load is small compared to the load of indexing the
   incoming data.

The key desire to use a stream-native framework is that the Druid
MiddleManager/Peon setup (the streaming task infrastructure for Druid) has
problems recovering from failure, recovering from upgrade easily, handling
late data well, and dynamic horizontal scaling. The Kafka Indexing Extension
<http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html>
handles some of these things, but still doesn't quite operate like a
stream-native solution, and is only available for Kafka.

What is not clear to me is if such a feature set would be in the intended
scope of Beam as opposed to having some other streaming data catcher
service that Beam simply feeds into (aka, a Tranquility
<https://github.com/druid-io/tranquility> sink). Having a customizable,
accessible and stateful sink with good commit semantics seems like it
should be something Beam could support natively, but the "accessible" part
is where I need some guidance on if that is in scope for Beam.

Can the dev list provide some insight to if there are any other Sinks that
strive to be accessible at run-time from non-Beam components?

Also, is such a use case something that is desired to be a part of what
Beam does, or would it be best outside of Beam?

Thank you,
Charles Allen

Reply via email to