[ 
https://issues.apache.org/jira/browse/DRILL-4791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16051042#comment-16051042
 ] 

Paul Rogers commented on DRILL-4791:
------------------------------------

>From Uwe Korn, on dev mailing list:

{quote}
Bringing in a bit of the perspective partly of an Arrow developer but mostly 
someone that works quite a lot in Python with the respective data libraries 
there: In Python all (performant) data chrunching work is done on columar 
representations. While this is partly due to columnar being a more CPU 
efficient on these tasks, this is also because columnar can be abstracted in a 
form that you implement all computational work with C/C++ or an LLVM-based JIT 
while still keeping clear and understandable interfaces in Python. In the end 
to make an efficient Python support, we will always have to convert into a 
columnar representation, making row-wise APIs to a system that is internally 
columnar quite annoying as we have a lot of wastage in the conversion layer. In 
the case that one would want to provide the ability to support Python UDFs, 
this would lead to the situation that in most cases the UDF calls will be 
greatly dominated by the conversion logic.

For the actual performance differences that this makes, you can have a look at 
the work that recently is happening in Apache Spark where Arrow is used for the 
conversion of the result from Spark's internal JVM data structures into typical 
Python ones ("Pandas DataFrames"). In comparision to the existing conversion, 
this sees currently a speedup of 40x but will be even higher once further steps 
are implemented. Julien should be able to provide a link to slides that outline 
the work better.

As I'm quite new to Drill, I cannot go into much further details w.r.t. Drill 
but be aware that for languages like Python, having a columnar API really 
matters. While Drill integrates with Python at the moment not really as a first 
class citizen, moving to row-wise APIs won't probably make a difference to the 
current situation but good columnar APIs would help us to keep the path open 
for the future.
{quote}

Response from Paul:

{quote}
This is incredibly helpful information! You explanation makes perfect sense.

We work quite a bit with ODBC and JDBC: two interfaces that are very much 
synchronous and row-based. There are three challenges key with working with 
Drill:

* Drill results are columnar, requiring a column-to-row translation for xDBC
* Drill uses an asynchronous API, while JDBC and ODBC are synchronous, 
resulting in an async-to-sync API translation.
* The JDBC API is based on the Drill client which requires quite a bit (almost 
all, really) of Drill code.

The thought is to create a new API that serves the need of ODBC and JDBC, but 
without the complexity (while, of course, preserving the existing client for 
other uses.) Said another way, find a way to keep the xDBC interfaces simple so 
that they don’t take quite so much space in the client, and don’t require quite 
so much work to maintain.

The first issue (row vs. columnar) turns out to not be a huge issue, the 
columnar-to-row translation code exists and works. The real issue is allowing 
the client to the size of the data sent from the server. (At present, the 
server decides the “batch” size, and sometimes the size is huge.) So, we can 
just focus on controlling batch size (and thus client buffer allocations), but 
retain the columnar form, even for ODBC and JDBC.

So, for the Pandas use case, does your code allow (or benefit from) multiple 
simultaneous queries over the same connection? Or, since Python seems to be 
only approximately multi-threaded, would a synchronous, columnar API work 
better? Here I just mean, in a single connection, is there a need to run 
multiple concurrent queries, or is the classic 
one-concurrent-query-per-connection model easier for Python to consume?
{quote}

> Provide a light-weight, versioned client API
> --------------------------------------------
>
>                 Key: DRILL-4791
>                 URL: https://issues.apache.org/jira/browse/DRILL-4791
>             Project: Apache Drill
>          Issue Type: New Feature
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>             Fix For: 2.0.0
>
>
> Drill's existing client APIs are "industrial strength" - they provide full 
> access to the sophisticated distributed, columnar RPCs which Drill uses 
> internall. However, they are too complex for most client needs. Provide a 
> simpler API optimized for clients: row-based result sets, synchronous, etc.
> At the same time, Drill clients must currently link with the same version of 
> Drill code as is running on the Drill cluster. This forces clients to upgrade 
> in lock-step with the cluster. Allow Drill clients to be upgraded after (or 
> even before) the Drill cluster to simplify management of desktop apps that 
> use Drill.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to