Hi All, As you've seen, I've been suggesting that we consider multiple choices for our internal data representation beyond just the current value vector layout and the "obvious" Arrow layout. And, that we consider out options based on where we see Drill adding value in the open source community.
Frankly, I'm not convinced that Arrow is the best solution. Arrow evolved from value vectors, by the same people who built value vectors. No doubt that Arrow is a more advanced, faster, higher-quality version of value vectors -- the Arrow team has put lots of effort into Arrow while Drill has more-or-less kept Value Vectors the same. So, we can agree that Arrow is likely better than value vectors. But, is a direct-memory, columnar solution optimal for a tool like Drill? This is the real question we should ask. Why do we have value vectors and Arrow? * Industry experience shows that a columnar layout is more performant for some kinds of workloads. Parquet makes good use of the columnar format for data compression and to reduce I/O for unwanted columns. * Netty uses async I/O which requires data to be in direct memory. Value vectors and Arrow thus store data in direct memory to avoid copies on network sends and receives. * Columnar layouts should be able to exploit CPU vector instructions. However, those instructions don't work with nullable data, so some clever work is needed. And, as noted earlier, Drill actually generates row-wise code, not column-wise. * The Java GC is a known evil (or was in Java 5 when Drill started.) The Drill developers felt they could do a better job of memory allocation than Java could; hence our VERY complex memory allocator, buffers, ledgers and all the rest. * Java has no good way to efficiently lay out a row in memory as a generic data structure. Most row-based tools in Java use boxed objects for columns, an object array as the row. This seems an expensive solution. The case for vectors (Value or Arrow) seems compelling. But, our job, as Drill leads, is to question and validate each of these assertions. The above are true, but they may not be the WHOLE truth. Columnar-vs-row: As it turns out, Impala, Presto and Spark are all row based. Sadly, all three of these tools have more adoption than Drill. So, a row-based layout, per-se, is not necessarily a bad thing. I saw a paper many years ago (which, alas, I can no longer find) that showed the result of experiments of columnar vs. row-based memory layouts. The gist was that, for numeric (analytic) queries, with a small number of columns, columnar performed better. Beyond some number of columns, row-based seemed to be better. Rather than continue to pick at each point one by one (this e-mail is already too long), let's just offer an example. Suppose we observe that 1) clients want a row-based data format, 2) most readers offer row-based data, 3) Drill uses Java code gen that processes data row-wise, 4) Java does have a very efficient row data structure: the class, 5) Java offers a built-in, fragmentation-free memory allocator: the GC system. So, suppose we just generate a Java class that uses fields to hold data for columns. Now we can test the assertions about direct-memory columnar being faster. There is a class in Drill, PerformanceTool, which was used to compare column accessor performance vs. direct vector access. Let's re-purpose it to compare the column accessors (on value vectors) vs. simulated generated code that works with rows as Java classes. The test runs 300 batches of 4 million rows per batch (total of 1.23G rows). For each batch, it writes a single int to each row, finalizes the batch, then reads the int back from each row. That is, it simulates a very simple reader followed by a filter. First, we do this using vector accessors for a required and nullable int and time the results (on an ancient Core i7 CPU, Ubuntu Linux): Required accessor: 10,434 ms (121M rows/sec) Nullable accessor: 16,605 ms (76M rows/sec) The above makes sense: a nullable vector has a second "bits" vector that we must fiddle with. The numbers include all the work to allocate vectors, much with ledgers and allocators, write to direct memory, read from direct memory and all the rest. Now, let's do the same thing with Java. We "generate" a class to hold the int (and for nullable, an "isSet" field). We use a Java array to hold the rows. Otherwise, the write-then-read pattern is the same. Now we get: Required object: 5100 ms (247M rows/sec, 2x faster) Nullable object: 6656 ms (189M rows/sec, 2.5x faster) WTF? We get twice the speed WITHOUT any of our fancy direct memory, DrillBuf, allocator, ledger, vector, accessor, code gen work? Could it be true that the entire Java industry, working to perfect Java GC, has done a better job than the original Drill developers did with our memory allocator? One wonders how much better Arrow might be. Imagine how much simpler code gen will be if we work only with Java objects rather than with layers of holders, readers, vectors and buffers. Imagine how much simpler Drill would be if we scrap the allocators, ledgers and the rest. (And, yes, we would not even need much of the guts of the column accessors.) Imagine how much simpler UDFs will be if they just work with Java. Imagine how much simpler format and plugins could be. All of this would help Drill be the best query tool to integrate with custom applications, data sources and logic. Seems too easy. We must be overlooking something. For example, we have not measured the cost of network I/O. But, remember that our exchanges are almost always hash or partition based. In a large cluster (100 nodes) with a beefy CPU (24 cores), each fragment must buffer an outgoing batch for each target fragment: that is 2500 buffered batches (at 10-100 MB each) plus all the shuffling to move rows into the correct partition. Very little of that would be needed for rows based on Java objects. It could be that all our copying and buffering takes more resources than the copy-free approach saves. Could be that Java objects would transfer faster, with less memory overhead. Still, can't be right, can it? IIRC, this is exactly the approach that Spark took before Data Frames, and Spark seemed to work out alright. Of course, the above tests only about 1% of Drill, and so is far from conclusive. Are Java objects the solution? I don't know, any more than I know that fixed-width blocks are the solution. I'm just asking the questions. The point is, we have to research and test to find what is best for the goals we identify for Drill. Thanks, - Paul