Hello Paul, Bringing in a bit of the perspective partly of an Arrow developer but mostly someone that works quite a lot in Python with the respective data libraries there: In Python all (performant) data chrunching work is done on columar representations. While this is partly due to columnar being a more CPU efficient on these tasks, this is also because columnar can be abstracted in a form that you implement all computational work with C/C++ or an LLVM-based JIT while still keeping clear and understandable interfaces in Python. In the end to make an efficient Python support, we will always have to convert into a columnar representation, making row-wise APIs to a system that is internally columnar quite annoying as we have a lot of wastage in the conversion layer. In the case that one would want to provide the ability to support Python UDFs, this would lead to the situation that in most cases the UDF calls will be greatly dominated by the conversion logic.
For the actual performance differences that this makes, you can have a look at the work that recently is happening in Apache Spark where Arrow is used for the conversion of the result from Spark's internal JVM data structures into typical Python ones ("Pandas DataFrames"). In comparision to the existing conversion, this sees currently a speedup of 40x but will be even higher once further steps are implemented. Julien should be able to provide a link to slides that outline the work better. As I'm quite new to Drill, I cannot go into much further details w.r.t. Drill but be aware that for languages like Python, having a columnar API really matters. While Drill integrates with Python at the moment not really as a first class citizen, moving to row-wise APIs won't probably make a difference to the current situation but good columnar APIs would help us to keep the path open for the future. Uwe > Am 13.06.2017 um 06:11 schrieb Paul Rogers <prog...@mapr.com>: > > Thanks for the suggestions! > > The issue is only partly Calcite changes. The real challenge for potential > contributors is that the Drill storage plugin exposes Calcite mechanisms > directly. That is, to write storage plugin, one must know (or, more likely, > experiment to learn) the odd set of calls made to the storage plugin, for a > group scan, then a sub scan, then this or that. Then, learning those calls, > map what you want to do to those calls. In some cases, as Calcite chugs > along, it calls the same methods multiple times, so the plugin writer has to > be prepared to implement caching to avoid banging on the underlying system > multiple times for the same data. > > The key opportunity here is to observe that the current API is at the > implementation level: as callbacks from Calcite. (Though, the Drill “easy” > storage plugin does hide some of the details.) Instead, we’d like an API at > the definition level: that the plugin simply declares that, say, it can > return a schema, or can handle certain kinds of filter push-down, etc. > > If we can define that API at the metadata (planning) level, then we can > create an adapter between that API and Calcite. Doing so makes it much easier > to test the plugin, and isolates the plugin from future code changes as > Calcite evolves and improves: the adapter changes but not the plugin metadata > API. > > As you suggest, the resulting definition API would be handy to share between > projects. > > On the execution side, however, Drill plugins are very specific to Drill’s > operator framework, Drill’s schema-on-read mechanism, Drill’s special columns > (file metadata, partitions), Drill’s vector “mutators” and so on. Here, any > synergy would be with Arrow to define a common “mutator” API so that a “row > batch reader” written for one system should work with the other. > > In any case, this kind of sharing is hard to define up front, we might > instead keep the discussion going to see what works for Drill, what we can > abstract out, and how we can make the common abstraction work for other > systems beyond Drill. > > Thanks, > > - Paul > >> On Jun 9, 2017, at 3:38 PM, Julian Hyde <jh...@apache.org> wrote: >> >> >>> On Jun 5, 2017, at 11:59 AM, Paul Rogers <prog...@mapr.com> wrote: >>> >>> Similarly, the storage plugin API exposes details of Calcite (which seems >>> to evolve with each new version), exposes value vector implementations, and >>> so on. A cleaner, simpler, more isolated API will allow storage plugins to >>> be built faster, but will also isolate them from Drill internals changes. >>> Without isolation, each change to Drill internals would require plugin >>> authors to update their plugin before Drill can be released. >> >> Sorry you’re getting burned by Calcite changes. We try to minimize impact, >> but sometimes it’s difficult to see what you’re breaking. >> >> I like the goal of a stable storage plugin API. Maybe it’s something Drill >> and Calcite can collaborate on? Much of the DNA of an adapter is independent >> of the engine that will consume the data (Drill or otherwise) - it concerns >> how to create a connection, getting metadata, and pushing down logical >> operations, and generating queries in the target system’s query language. >> Calcite and Drill ought to be able to share that part, rather than >> maintaining separate collections of adapters. >> >> Julian >> >