There are no existing sinks which are accessible from outside of Apache Beam.
Apache Beam does work by processing "bundles" (a unit of work). Each of these are executed and results are committed back into a Runner (such as Flink/Spark/...). The lifetime of an individual instance of a sink is bound to the bundle being processed and can be as small as a few milliseconds. There are some sinks which cache stuff within the JVM (like connections) but the caching is best effort and if the machine was to go down (crash, autoscaling reduces number of workers, ...) and if it ever comes back the cached state was either unimportant or easily recoverable. Can druid handle sinks being created and removed dynamically and at what rate? What are the expectations around this partial data that has been collected by a sink? On Wed, Feb 7, 2018 at 7:31 AM, Charles Allen <charles.al...@snap.com> wrote: > Hello, > > I work closely with druid.io and one of the main pain points for any > Druid deployment is handling the real-time streaming component. What I, > personally, would *like* to have is the streaming orchestration and > streaming state handled by a runner which specializes in such things, and > allow Druid to focus on the lightning fast ad-hoc query side. > > A natural contender for such a setup would be a Beam based solution with a > Druid segment Sink. The trouble we are having pursuing such a setup is two > fold: > > 1. Many Druid setups run lambda-style pipes to backfill late or wrong > data, so the sink needs to call out to the Druid cluster for data version > locking and orchestration. Access must be available from the sink to the > Druid Overlord and/or Coordinator, or potentially some other task-specific > jvm in the cluster. > 2. Druid segments are queryable while they are being built. This means > that the Druid cluster (Broker specifically) must be able to discover and > issue *RPC queries* against the Sink on the partial data it has > accumulated. This puts an extra dynamic load on the Sink jvm, but in my own > experience that extra load is small compared to the load of indexing the > incoming data. > > The key desire to use a stream-native framework is that the Druid > MiddleManager/Peon setup (the streaming task infrastructure for Druid) has > problems recovering from failure, recovering from upgrade easily, handling > late data well, and dynamic horizontal scaling. The Kafka Indexing > Extension > <http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html> > handles some of these things, but still doesn't quite operate like a > stream-native solution, and is only available for Kafka. > > What is not clear to me is if such a feature set would be in the intended > scope of Beam as opposed to having some other streaming data catcher > service that Beam simply feeds into (aka, a Tranquility > <https://github.com/druid-io/tranquility> sink). Having a customizable, > accessible and stateful sink with good commit semantics seems like it > should be something Beam could support natively, but the "accessible" part > is where I need some guidance on if that is in scope for Beam. > > Can the dev list provide some insight to if there are any other Sinks that > strive to be accessible at run-time from non-Beam components? > > Also, is such a use case something that is desired to be a part of what > Beam does, or would it be best outside of Beam? > > Thank you, > Charles Allen >