[ https://issues.apache.org/jira/browse/DRILL-4791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16051042#comment-16051042 ]
Paul Rogers commented on DRILL-4791: ------------------------------------ >From Uwe Korn, on dev mailing list: {quote} Bringing in a bit of the perspective partly of an Arrow developer but mostly someone that works quite a lot in Python with the respective data libraries there: In Python all (performant) data chrunching work is done on columar representations. While this is partly due to columnar being a more CPU efficient on these tasks, this is also because columnar can be abstracted in a form that you implement all computational work with C/C++ or an LLVM-based JIT while still keeping clear and understandable interfaces in Python. In the end to make an efficient Python support, we will always have to convert into a columnar representation, making row-wise APIs to a system that is internally columnar quite annoying as we have a lot of wastage in the conversion layer. In the case that one would want to provide the ability to support Python UDFs, this would lead to the situation that in most cases the UDF calls will be greatly dominated by the conversion logic. For the actual performance differences that this makes, you can have a look at the work that recently is happening in Apache Spark where Arrow is used for the conversion of the result from Spark's internal JVM data structures into typical Python ones ("Pandas DataFrames"). In comparision to the existing conversion, this sees currently a speedup of 40x but will be even higher once further steps are implemented. Julien should be able to provide a link to slides that outline the work better. As I'm quite new to Drill, I cannot go into much further details w.r.t. Drill but be aware that for languages like Python, having a columnar API really matters. While Drill integrates with Python at the moment not really as a first class citizen, moving to row-wise APIs won't probably make a difference to the current situation but good columnar APIs would help us to keep the path open for the future. {quote} Response from Paul: {quote} This is incredibly helpful information! You explanation makes perfect sense. We work quite a bit with ODBC and JDBC: two interfaces that are very much synchronous and row-based. There are three challenges key with working with Drill: * Drill results are columnar, requiring a column-to-row translation for xDBC * Drill uses an asynchronous API, while JDBC and ODBC are synchronous, resulting in an async-to-sync API translation. * The JDBC API is based on the Drill client which requires quite a bit (almost all, really) of Drill code. The thought is to create a new API that serves the need of ODBC and JDBC, but without the complexity (while, of course, preserving the existing client for other uses.) Said another way, find a way to keep the xDBC interfaces simple so that they don’t take quite so much space in the client, and don’t require quite so much work to maintain. The first issue (row vs. columnar) turns out to not be a huge issue, the columnar-to-row translation code exists and works. The real issue is allowing the client to the size of the data sent from the server. (At present, the server decides the “batch” size, and sometimes the size is huge.) So, we can just focus on controlling batch size (and thus client buffer allocations), but retain the columnar form, even for ODBC and JDBC. So, for the Pandas use case, does your code allow (or benefit from) multiple simultaneous queries over the same connection? Or, since Python seems to be only approximately multi-threaded, would a synchronous, columnar API work better? Here I just mean, in a single connection, is there a need to run multiple concurrent queries, or is the classic one-concurrent-query-per-connection model easier for Python to consume? {quote} > Provide a light-weight, versioned client API > -------------------------------------------- > > Key: DRILL-4791 > URL: https://issues.apache.org/jira/browse/DRILL-4791 > Project: Apache Drill > Issue Type: New Feature > Reporter: Paul Rogers > Assignee: Paul Rogers > Fix For: 2.0.0 > > > Drill's existing client APIs are "industrial strength" - they provide full > access to the sophisticated distributed, columnar RPCs which Drill uses > internall. However, they are too complex for most client needs. Provide a > simpler API optimized for clients: row-based result sets, synchronous, etc. > At the same time, Drill clients must currently link with the same version of > Drill code as is running on the Drill cluster. This forces clients to upgrade > in lock-step with the cluster. Allow Drill clients to be upgraded after (or > even before) the Drill cluster to simplify management of desktop apps that > use Drill. -- This message was sent by Atlassian JIRA (v6.4.14#64029)