I'm not that familiar with machine learning, but is there potential value in having Druid be a "consumer" of machine learning, such as for optimization purposes?
For example, training a Druid cluster on past queries as part of a query cost estimator. On Tue, Jan 28, 2020 at 12:39 AM Roman Leventov <leventov...@gmail.com> wrote: > However, I now see the Charles' point -- the data which is typically stored > in Druid rows is simple and is not something models are typically applied > to. Timeseries themselves (that is, the results of timeseries queries in > Druid) may be an input for anomaly detection or phase transition models, > but there is not point in applying them inside Druid. > > One corner case is sketches which are time series, so models could be > applied to them individually. > > On Tue, 28 Jan 2020 at 08:59, Roman Leventov <leventov...@gmail.com> > wrote: > > > I was thinking about model training at Druid indexing side and evaluation > > at Druid querying side. > > > > The advantage Druid has over Spark at querying is faster row filtering > > thanks to bitset indexes. But since model evaluation is a pretty heavy > > operation (I suppose; does anyone has ballpark time estimates? how does > it > > compare to Sketch update?) then row scanning may not be the bottleneck > and > > therefore no significant reason to use Druid instead of just plugging > Spark > > engine to Druid segments. > > > > At indexing side, Druid indexer may be considered a general-purpose job > > scheduler so that somebody who already has Druid may leverage it instead > of > > setting up a separate Airflow scheduler. > > > > On Tue, 28 Jan 2020, 06:46 Charles Allen, <cral...@apache.org> wrote: > > > >> > it makes more sense to have tooling around Druid, to do slice and > dice > >> the data that you need, and do the ml stuff in sklearn, or even in spark > >> > >> I agree with this sentiment. Druid as an execution engine is very good > at > >> doing distributed aggregation (distributed reduce). What advantage does > >> Druid as an engine have that Spark does not for ML? > >> > >> Are you talking training or model evaluation? or any? > >> > >> It *might* be possible to have a likeness mechanism, whereby you can > pass > >> in a model as a filter and aggregate on rows (dimension tuples?) that > >> match > >> the model by some minimum criteria, but I'm not really sure what utility > >> that would be. Maybe as a quick backtesting engine? I feel like I'm a > >> solution searching for a problem going down this route though. > >> > >> > >> > >> > >> > >> > >> On Mon, Jan 27, 2020 at 12:11 AM Driesprong, Fokko <fo...@driesprong.frl > > > >> wrote: > >> > >> > > Vertica has it. Good idea to introduce it in Druid. > >> > > >> > I'm not sure if this is a valid argument. With this argument, you can > >> > introduce anything into Druid. I think it is good to be opinionated, > >> and as > >> > a community why we do or don't introduce ML possibilities into the > >> > software. > >> > > >> > For example, databases like Postgres and Bigquery allow users to do > >> simple > >> > regression models: > >> > https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro. I also > >> don't > >> > think it isn't that hard to introduce linear regression using gradient > >> > decent into Druid: > >> > > >> > > >> > https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/ > >> > However, > >> > how many people are going to use this? > >> > > >> > For me, it makes more sense to have tooling around Druid, to do slice > >> and > >> > dice the data that you need, and do the ml stuff in sklearn, or even > in > >> > spark. For example using https://github.com/druid-io/pydruid or > having > >> the > >> > ability to use Spark to read directly from the deep storage. > >> > > >> > Introducing models using SP or UDF's is also a possibility, but here I > >> > share the concerns of Sayat when it comes to performance and > >> scalability. > >> > > >> > Cheers, Fokko > >> > > >> > > >> > > >> > Op za 25 jan. 2020 om 08:51 schreef Gaurav Bhatnagar < > >> gaura...@gmail.com>: > >> > > >> > > +1 > >> > > > >> > > Vertica has it. Good idea to introduce it in Druid. > >> > > > >> > > On Mon, Jan 13, 2020 at 12:52 AM Dusan Maric <thema...@gmail.com> > >> wrote: > >> > > > >> > > > +1 > >> > > > > >> > > > That would be a great idea! Thanks for sharing this. > >> > > > > >> > > > Would just like to chime in on Druid + ML model cases: predictions > >> and > >> > > > anomaly detection on top of TensorFlow ❤ > >> > > > > >> > > > Regards, > >> > > > > >> > > > On Fri, Jan 10, 2020 at 6:41 AM Roman Leventov < > >> leventov...@gmail.com> > >> > > > wrote: > >> > > > > >> > > > > Hello Druid developers, what do you think about the future of > >> Druid & > >> > > > > machine learning? > >> > > > > > >> > > > > Druid has been great at complex aggregations. Could (should?) It > >> make > >> > > > > inroads into ML? Perhaps aggregators which apply the rows > against > >> > some > >> > > > > pre-trained model and summarize results. > >> > > > > > >> > > > > Should model training stay completely external to Druid, or it > >> could > >> > be > >> > > > > incorporated into Druid's data lifecycle on a conceptual level, > >> such > >> > > as a > >> > > > > recurring "indexing" task which stores the result (the model) in > >> > > Druid's > >> > > > > deep storage, the model automatically loaded on historical nodes > >> as > >> > > > needed > >> > > > > (just like segments) and certain aggregators pick up the latest > >> > model? > >> > > > > > >> > > > > Does this make any sense? In what cases Druid & ML will and will > >> not > >> > > work > >> > > > > well together, and ML should stay a Spark's prerogative? > >> > > > > > >> > > > > I would be very interested to hear any thoughts on the topic, > >> vague > >> > > ideas > >> > > > > and questions. > >> > > > > > >> > > > > >> > > > > >> > > > -- > >> > > > Dušan Marić > >> > > > mob.: +381 64 1124779 | e-mail: thema...@gmail.com | skype: > >> themaric > >> > > > > >> > > > >> > > >> > > >