Hi All,

As you've seen, I've been suggesting that we consider multiple choices for our 
internal data representation beyond just the current value vector layout and 
the "obvious" Arrow layout. And, that we consider out options based on where we 
see Drill adding value in the open source community.


Frankly, I'm not convinced that Arrow is the best solution. Arrow evolved from 
value vectors, by the same people who built value vectors. No doubt that Arrow 
is a more advanced, faster, higher-quality version of value vectors -- the 
Arrow team has put lots of effort into Arrow while Drill has more-or-less kept 
Value Vectors the same.

So, we can agree that Arrow is likely better than value vectors. But, is a 
direct-memory, columnar solution optimal for a tool like Drill? This is the 
real question we should ask.

Why do we have value vectors and Arrow?

* Industry experience shows that a columnar layout is more performant for some 
kinds of workloads.  Parquet makes good use of the columnar format for data 
compression and to reduce I/O for unwanted columns.


* Netty uses async I/O which requires data to be in direct memory. Value 
vectors and Arrow thus store data in direct memory to avoid copies on network 
sends and receives.

* Columnar layouts should be able to exploit CPU vector instructions. However, 
those instructions don't work with nullable data, so some clever work is 
needed. And, as noted earlier, Drill actually generates row-wise code, not 
column-wise.

* The Java GC is a known evil (or was in Java 5 when Drill started.) The Drill 
developers felt they could do a better job of memory allocation than Java 
could; hence our VERY complex memory allocator, buffers, ledgers and all the 
rest.

* Java has no good way to efficiently lay out a row in memory as a generic data 
structure. Most row-based tools in Java use boxed objects for columns, an 
object array as the row. This seems an expensive solution.


The case for vectors (Value or Arrow) seems compelling. But, our job, as Drill 
leads, is to question and validate each of these assertions. The above are 
true, but they may not be the WHOLE truth.


Columnar-vs-row: As it turns out, Impala, Presto and Spark are all row based. 
Sadly, all three of these tools have more adoption than Drill. So, a row-based 
layout, per-se, is not necessarily a bad thing.


I saw a paper many years ago (which, alas, I can no longer find) that showed 
the result of experiments of columnar vs. row-based memory layouts. The gist 
was that, for numeric (analytic) queries, with a small number of columns, 
columnar performed better. Beyond some number of columns, row-based seemed to 
be better.

Rather than continue to pick at each point one by one (this e-mail is already 
too long), let's just offer an example.

Suppose we observe that 1) clients want a row-based data format, 2) most 
readers offer row-based data, 3) Drill uses Java code gen that processes data 
row-wise, 4) Java does have a very efficient row data structure: the class, 5) 
Java offers a built-in, fragmentation-free memory allocator: the GC system.


So, suppose we just generate a Java class that uses fields to hold data for 
columns. Now we can test the assertions about direct-memory columnar being 
faster. There is a class in Drill, PerformanceTool, which was used to compare 
column accessor performance vs. direct vector access. Let's re-purpose it to 
compare the column accessors (on value vectors) vs. simulated generated code 
that works with rows as Java classes.

The test runs 300 batches of 4 million rows per batch (total of 1.23G rows). 
For each batch, it writes a single int to each row, finalizes the batch, then 
reads the int back from each row. That is, it simulates a very simple reader 
followed by a filter.

First, we do this using vector accessors for a required and nullable int and 
time the results (on an ancient Core i7 CPU, Ubuntu Linux):

Required accessor: 10,434 ms (121M rows/sec)

Nullable accessor: 16,605 ms (76M rows/sec)


The above makes sense: a nullable vector has a second "bits" vector that we 
must fiddle with. The numbers include all the work to allocate vectors, much 
with ledgers and allocators, write to direct memory, read from direct memory 
and all the rest.

Now, let's do the same thing with Java. We "generate" a class to hold the int 
(and for nullable, an "isSet" field). We use a Java array to hold the rows. 
Otherwise, the write-then-read pattern is the same. Now we get:

Required object: 5100 ms  (247M rows/sec, 2x faster)

Nullable object: 6656 ms  (189M rows/sec, 2.5x faster)

WTF? We get twice the speed WITHOUT any of our fancy direct memory, DrillBuf, 
allocator, ledger, vector, accessor, code gen work? Could it be true that the 
entire Java industry, working to perfect Java GC, has done a better job than 
the original Drill developers did with our memory allocator? One wonders how 
much better Arrow might be.


Imagine how much simpler code gen will be if we work only with Java objects 
rather than with layers of holders, readers, vectors and buffers. Imagine how 
much simpler Drill would be if we scrap the allocators, ledgers and the rest. 
(And, yes, we would not even need much of the guts of the column accessors.)


Imagine how much simpler UDFs will be if they just work with Java. Imagine how 
much simpler format and plugins could be. All of this would help Drill be the 
best query tool to integrate with custom applications, data sources and logic.


Seems too easy. We must be overlooking something. For example, we have not 
measured the cost of network I/O. But, remember that our exchanges are almost 
always hash or partition based. In a large cluster (100 nodes) with a beefy CPU 
(24 cores), each fragment must buffer an outgoing batch for each target 
fragment: that is 2500 buffered batches (at 10-100 MB each) plus all the 
shuffling to move rows into the correct partition. Very little of that would be 
needed for rows based on Java objects. It could be that all our copying and 
buffering takes more resources than the copy-free approach saves. Could be that 
Java objects would transfer faster, with less memory overhead.

Still, can't be right, can it? IIRC, this is exactly the approach that Spark 
took before Data Frames, and Spark seemed to work out alright. Of course, the 
above tests only about 1% of Drill, and so is far from conclusive.


Are Java objects the solution? I don't know, any more than I know that 
fixed-width blocks are the solution. I'm just asking the questions. The point 
is, we have to research and test to find what is best for the goals we identify 
for Drill.


Thanks,
- Paul


Reply via email to