Hi Andy, Paul, I would think that the machine learning "pipeline" would be a great use case for this. From my experience, Spark is not the easiest to do data manipulation with, especially if you have complex data pipelines. This is where Drill can really excel, so my thought is that if you are building a model with multiple diverse (and discrete) data sources, it would probably be a lot easier for the analyst to do the data gathering and prep using Drill and then have those results consumed by Spark.
As for the other way around, I could see it as being useful to output the results of a ML model being included in a Drill query. IE, data from a model is outputted, perhaps with a primary key of sorts, and then Drill could read this and join it with original data. I think this is pretty much what you're saying but this is a very valid use case. My experience is that Spark will work will with conventional data, but the second you get out of that realm, it becomes very difficult to deal with. -- C > On Jan 13, 2020, at 8:06 PM, Paul Rogers <par0...@yahoo.com.INVALID> wrote: > > Thanks Andy! > > Very helpful. You have hit on one of the questions that we've been wrestling > with: which tools would consume Drill data as Arrow? More generally, what are > the use cases for Arrow data interchange? > > Flight makes sense for transferring large data sets, such as in exchanges > within a distributed engine, or from a "data service" such as a hypothetical > Flight-based S3 Select. Flight (and Arrow in general) seems less useful as a > client API for things like BI tools, dashboards and the like; xDBC seems like > a better fit since such tools will consume "human-sized" result sets. > > The article in your link notes that there is a Spark consumer for Flight. > Drill's use case would likely be similar -- both tools could consume large > data sets from Flight-enabled sources. > > As for Drill as a producer, one could conjure an example in which Spark reads > data from Drill. Maybe Drill runs a number of complex SQL queries to produce > data sets upon which Spark runs some ML tasks. Drill is probably a better > tool to run the kind of monster SQL statements that business analysts like to > create, but Spark is better for the kind of algorithmic processing typical of > ML. (One could argue, with Flight, you get the best of both worlds. Charles, > we need your insight here.) Perhaps Flight's creators have similar scenarios > in mind. > > More practically, between the example flight server you mentioned (as a > producer) and Spark (as a consumer), we have what we need if someone wants to > create the prototypes we mentioned. > > Or, if someone wants to get very meta, we can have Drill using Flight to read > from another Drill. Not sure it's useful, but would be a cool demo. > > Thanks, > > - Paul > > > > On Monday, January 13, 2020, 04:21:29 PM PST, Andy Grove > <andygrov...@gmail.com> wrote: > > Hi Paul, > > There is a test flight server in the Arrow Java project  that might be a > good starting point, although I haven't used it myself. I was looking at > Arrow Flight for my Ballista Poc  although I don't really have time to > spend on that right now. > > I'm less sure of the value of having an Arrow consumer for Drill since any > vectorized processing would already have been performed by Drill? I may be > missing something though. > > Thanks, > > Andy. > >  > https://github.com/apache/arrow/tree/master/java/flight/flight-core#example-usage >  https://github.com/andygrove/ballista > >