Hi Uwe, This is incredibly helpful information! You explanation makes perfect sense.
We work quite a bit with ODBC and JDBC: two interfaces that are very much synchronous and row-based. There are three challenges key with working with Drill: * Drill results are columnar, requiring a column-to-row translation for xDBC * Drill uses an asynchronous API, while JDBC and ODBC are synchronous, resulting in an async-to-sync API translation. * The JDBC API is based on the Drill client which requires quite a bit (almost all, really) of Drill code. The thought is to create a new API that serves the need of ODBC and JDBC, but without the complexity (while, of course, preserving the existing client for other uses.) Said another way, find a way to keep the xDBC interfaces simple so that they don’t take quite so much space in the client, and don’t require quite so much work to maintain. The first issue (row vs. columnar) turns out to not be a huge issue, the columnar-to-row translation code exists and works. The real issue is allowing the client to the size of the data sent from the server. (At present, the server decides the “batch” size, and sometimes the size is huge.) So, we can just focus on controlling batch size (and thus client buffer allocations), but retain the columnar form, even for ODBC and JDBC. So, for the Pandas use case, does your code allow (or benefit from) multiple simultaneous queries over the same connection? Or, since Python seems to be only approximately multi-threaded, would a synchronous, columnar API work better? Here I just mean, in a single connection, is there a need to run multiple concurrent queries, or is the classic one-concurrent-query-per-connection model easier for Python to consume? Another point you raise is that our client-side column format should be Arrow, or Arrow-compatible. (That is, either using Arrow code, or the same data format as Arrow.) That way users of your work can easily leverage Drill. This last question raises an interesting issue that I (at least) need to understand more clearly. Is Arrow a data format + code? Or, is the data format one aspect of Arrow, and the implementation another? Would be great to have a common data format, but as we squeeze ever more performance from Drill, we find we have to very carefully tune our data manipulation code for the specific needs of Drill queries. I wonder how we’d do that if we switched to using Arrow’s generic vector implementation code? Has anyone else wrestled with this question for your project? Thanks, - Paul > On Jun 15, 2017, at 12:23 AM, Uwe L. Korn <uw...@xhochy.com> wrote: > > Hello Paul, > > Bringing in a bit of the perspective partly of an Arrow developer but mostly > someone that works quite a lot in Python with the respective data libraries > there: In Python all (performant) data chrunching work is done on columar > representations. While this is partly due to columnar being a more CPU > efficient on these tasks, this is also because columnar can be abstracted in > a form that you implement all computational work with C/C++ or an LLVM-based > JIT while still keeping clear and understandable interfaces in Python. In the > end to make an efficient Python support, we will always have to convert into > a columnar representation, making row-wise APIs to a system that is > internally columnar quite annoying as we have a lot of wastage in the > conversion layer. In the case that one would want to provide the ability to > support Python UDFs, this would lead to the situation that in most cases the > UDF calls will be greatly dominated by the conversion logic. > > For the actual performance differences that this makes, you can have a look > at the work that recently is happening in Apache Spark where Arrow is used > for the conversion of the result from Spark's internal JVM data structures > into typical Python ones ("Pandas DataFrames"). In comparision to the > existing conversion, this sees currently a speedup of 40x but will be even > higher once further steps are implemented. Julien should be able to provide a > link to slides that outline the work better. > > As I'm quite new to Drill, I cannot go into much further details w.r.t. Drill > but be aware that for languages like Python, having a columnar API really > matters. While Drill integrates with Python at the moment not really as a > first class citizen, moving to row-wise APIs won't probably make a difference > to the current situation but good columnar APIs would help us to keep the > path open for the future. > > Uwe