Re: Thinking about Drill 2.0

Paul Rogers Thu, 15 Jun 2017 10:40:11 -0700

Hi Uwe,

This is incredibly helpful information! You explanation makes perfect sense.

We work quite a bit with ODBC and JDBC: two interfaces that are very much 
synchronous and row-based. There are three challenges key with working with 
Drill:

* Drill results are columnar, requiring a column-to-row translation for xDBC
* Drill uses an asynchronous API, while JDBC and ODBC are synchronous, 
resulting in an async-to-sync API translation.
* The JDBC API is based on the Drill client which requires quite a bit (almost 
all, really) of Drill code.

The thought is to create a new API that serves the need of ODBC and JDBC, but 
without the complexity (while, of course, preserving the existing client for 
other uses.) Said another way, find a way to keep the xDBC interfaces simple so 
that they don’t take quite so much space in the client, and don’t require quite 
so much work to maintain.

The first issue (row vs. columnar) turns out to not be a huge issue, the 
columnar-to-row translation code exists and works. The real issue is allowing 
the client to the size of the data sent from the server. (At present, the 
server decides the “batch” size, and sometimes the size is huge.) So, we can 
just focus on controlling batch size (and thus client buffer allocations), but 
retain the columnar form, even for ODBC and JDBC.

So, for the Pandas use case, does your code allow (or benefit from) multiple 
simultaneous queries over the same connection? Or, since Python seems to be 
only approximately multi-threaded, would a synchronous, columnar API work 
better? Here I just mean, in a single connection, is there a need to run 
multiple concurrent queries, or is the classic 
one-concurrent-query-per-connection model easier for Python to consume?

Another point you raise is that our client-side column format should be Arrow, 
or Arrow-compatible. (That is, either using Arrow code, or the same data format 
as Arrow.) That way users of your work can easily leverage Drill.

This last question raises an interesting issue that I (at least) need to 
understand more clearly. Is Arrow a data format + code? Or, is the data format 
one aspect of Arrow, and the implementation another? Would be great to have a 
common data format, but as we squeeze ever more performance from Drill, we find 
we have to very carefully tune our data manipulation code for the specific 
needs of Drill queries. I wonder how we’d do that if we switched to using 
Arrow’s generic vector implementation code? Has anyone else wrestled with this 
question for your project?

Thanks,

- Paul

> On Jun 15, 2017, at 12:23 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
> 
> Hello Paul,
> 
> Bringing in a bit of the perspective partly of an Arrow developer but mostly 
> someone that works quite a lot in Python with the respective data libraries 
> there: In Python all (performant) data chrunching work is done on columar 
> representations. While this is partly due to columnar being a more CPU 
> efficient on these tasks, this is also because columnar can be abstracted in 
> a form that you implement all computational work with C/C++ or an LLVM-based 
> JIT while still keeping clear and understandable interfaces in Python. In the 
> end to make an efficient Python support, we will always have to convert into 
> a columnar representation, making row-wise APIs to a system that is 
> internally columnar quite annoying as we have a lot of wastage in the 
> conversion layer. In the case that one would want to provide the ability to 
> support Python UDFs, this would lead to the situation that in most cases the 
> UDF calls will be greatly dominated by the conversion logic.
> 
> For the actual performance differences that this makes, you can have a look 
> at the work that recently is happening in Apache Spark where Arrow is used 
> for the conversion of the result from Spark's internal JVM data structures 
> into typical Python ones ("Pandas DataFrames"). In comparision to the 
> existing conversion, this sees currently a speedup of 40x but will be even 
> higher once further steps are implemented. Julien should be able to provide a 
> link to slides that outline the work better.
> 
> As I'm quite new to Drill, I cannot go into much further details w.r.t. Drill 
> but be aware that for languages like Python, having a columnar API really 
> matters. While Drill integrates with Python at the moment not really as a 
> first class citizen, moving to row-wise APIs won't probably make a difference 
> to the current situation but good columnar APIs would help us to keep the 
> path open for the future.
> 
> Uwe

Re: Thinking about Drill 2.0

Reply via email to