Re: Thinking about Drill 2.0

Uwe L. Korn Sat, 17 Jun 2017 11:19:20 -0700

Hello Paul,

see inline comments.

Keep in mind that I'm new to Drill and still at the stage of only using it 
experimentally. Still, we have quite some relational DB usage in our workflows, 
so the same problems will apply to Drill as the interfaces are all quite 
common. While I have different problems I use DBs/query engines for, the main 
workflow I need columnar data structures for is during machine learning model 
development / usage: join different tables & data sets; apply transformations 
using SQL; ingest this data; prepare data further in Python; feed to machine 
learning pipeline.

> Am 15.06.2017 um 19:39 schrieb Paul Rogers <prog...@mapr.com>:
> 
> Hi Uwe,
> 
> This is incredibly helpful information! You explanation makes perfect sense.
> 
> We work quite a bit with ODBC and JDBC: two interfaces that are very much 
> synchronous and row-based.

I have not yet dealt directly with the ODBC API itself directly, only 
abstractions on it but at least from what I have heard, it should support a 
mode where the results are returned in a columnar fashion. We have therefore 
developed https://github.com/blue-yonder/turbodbc to have a columnar ODBC 
interface in Python / Pandas. Before that, we often had the problem that using 
the CSV export from the DB and then parsing the CSV was much faster as the 
Python ODBC libraries. There are some quite efficient CSV parsers around that 
are magnitudes faster then pure-python row-to-column conversion code.

> There are three challenges key with working with Drill:
> 
> * Drill results are columnar, requiring a column-to-row translation for xDBC
> * Drill uses an asynchronous API, while JDBC and ODBC are synchronous, 
> resulting in an async-to-sync API translation.
> * The JDBC API is based on the Drill client which requires quite a bit 
> (almost all, really) of Drill code.
> 
> The thought is to create a new API that serves the need of ODBC and JDBC, but 
> without the complexity (while, of course, preserving the existing client for 
> other uses.) Said another way, find a way to keep the xDBC interfaces simple 
> so that they don’t take quite so much space in the client, and don’t require 
> quite so much work to maintain.
> 
> The first issue (row vs. columnar) turns out to not be a huge issue, the 
> columnar-to-row translation code exists and works. The real issue is allowing 
> the client to the size of the data sent from the server. (At present, the 
> server decides the “batch” size, and sometimes the size is huge.) So, we can 
> just focus on controlling batch size (and thus client buffer allocations), 
> but retain the columnar form, even for ODBC and JDBC.
> 
> So, for the Pandas use case, does your code allow (or benefit from) multiple 
> simultaneous queries over the same connection? Or, since Python seems to be 
> only approximately multi-threaded, would a synchronous, columnar API work 
> better? Here I just mean, in a single connection, is there a need to run 
> multiple concurrent queries, or is the classic 
> one-concurrent-query-per-connection model easier for Python to consume?

This greatly depends on the workflow. I could image that for the applications 
where you start a large query that only produces a few rows as result, it would 
be worth to start multiple of these at once and let the client wait 
asynchronously for all of them. For me, as a data scientist, most queries 
return a rather large result set. I would typically join data sources together 
and do some transformations on the data with SQL but in the end, I would have a 
single table as result with which a longer running machine learning pipeline is 
then fed. For this, mutliple connections or multiple queries are not relevant 
but due to the size of the result, the time for the serializations between 
ODBC, Pandas/Arrow and Python structures is the more relevantant time 
component. In contrast to the query which may scale over a large number of 
workers, the serialization is done on a single worker at the end.

> Another point you raise is that our client-side column format should be 
> Arrow, or Arrow-compatible. (That is, either using Arrow code, or the same 
> data format as Arrow.) That way users of your work can easily leverage Drill.

Recently we have added an Apache Arrow interface to Turbodbc: 
http://arrow.apache.org/blog/2017/06/16/turbodbc-arrow/ While that still gets 
the data via the traditional ODBC interface and does need some transformations, 
the data that is passed from C++ to Python is structured in a way that doesn't 
need to transformed again in Python (which is the typical problem with 
traditional Python DB interfaces that return row-wise Python objects as 
results). Having a client that would already return Arrow structure instead of 
the ODBC ones would save at least one memory copy.

> This last question raises an interesting issue that I (at least) need to 
> understand more clearly. Is Arrow a data format + code? Or, is the data 
> format one aspect of Arrow, and the implementation another? Would be great to 
> have a common data format, but as we squeeze ever more performance from 
> Drill, we find we have to very carefully tune our data manipulation code for 
> the specific needs of Drill queries. I wonder how we’d do that if we switched 
> to using Arrow’s generic vector implementation code? Has anyone else wrestled 
> with this question for your project?

Arrow is at its core a data format / specification for an in-memory 
representation of columnar data. In addition, it also brings Java / C++ / 
Python / C-GLib / JavaScript implementations of these data structures and 
helper functions to construct these functions. As I'm only involved in the C++ 
/ Python side of things, I cannot really tell much detail about the Java 
implementation but as this one was forked out of Drill, it should still be very 
close to the ValueVectors. But as an C++ consumer, for me it only matters that 
the format is respected, not which specific (code) implementation is used on 
the producer side.

Main takeaways: 
 * My queries produce rather large results (1-10 million rows)
 * These results are consumed by a single worker. If there were more consuming 
workers, the query results would scale with the number of workers, i.e. each 
worker would still receive 1-10 million rows.
 * Serialization / Transformation cost in the client code is my main concern. 
To effectively use Python on large data, I need columnar data, having a 
row-wise API may be the simpler interface but is very costly in the end.
 * If we would extended this to UDFs in Python, they would have the same 
serialization/transformation problem as the client code, just on a smaller 
number of rows.

Uwe

> 
> Thanks,
> 
> - Paul
> 
> 
>> On Jun 15, 2017, at 12:23 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>> 
>> Hello Paul,
>> 
>> Bringing in a bit of the perspective partly of an Arrow developer but mostly 
>> someone that works quite a lot in Python with the respective data libraries 
>> there: In Python all (performant) data chrunching work is done on columar 
>> representations. While this is partly due to columnar being a more CPU 
>> efficient on these tasks, this is also because columnar can be abstracted in 
>> a form that you implement all computational work with C/C++ or an LLVM-based 
>> JIT while still keeping clear and understandable interfaces in Python. In 
>> the end to make an efficient Python support, we will always have to convert 
>> into a columnar representation, making row-wise APIs to a system that is 
>> internally columnar quite annoying as we have a lot of wastage in the 
>> conversion layer. In the case that one would want to provide the ability to 
>> support Python UDFs, this would lead to the situation that in most cases the 
>> UDF calls will be greatly dominated by the conversion logic.
>> 
>> For the actual performance differences that this makes, you can have a look 
>> at the work that recently is happening in Apache Spark where Arrow is used 
>> for the conversion of the result from Spark's internal JVM data structures 
>> into typical Python ones ("Pandas DataFrames"). In comparision to the 
>> existing conversion, this sees currently a speedup of 40x but will be even 
>> higher once further steps are implemented. Julien should be able to provide 
>> a link to slides that outline the work better.
>> 
>> As I'm quite new to Drill, I cannot go into much further details w.r.t. 
>> Drill but be aware that for languages like Python, having a columnar API 
>> really matters. While Drill integrates with Python at the moment not really 
>> as a first class citizen, moving to row-wise APIs won't probably make a 
>> difference to the current situation but good columnar APIs would help us to 
>> keep the path open for the future.
>> 
>> Uwe
>

Re: Thinking about Drill 2.0

Reply via email to