[GitHub] [drill] jnturton commented on issue #2421: ValueVectors replacement

GitBox Tue, 04 Jan 2022 04:03:31 -0800


jnturton commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1004750707



   Paul Rogers wrote:
   
   All: so I've kicked the hornet's nest with the mention of value vectors and 
Arrow. I'm going to put on my flame-proof suit and debunk some myths.
   
   The columnar format is great for storage, for all the usual reasons. This is 
why Parquet uses it, Druid uses it for segment files, and various DBs use it 
for storage. The question we want to ask is, do those benefits apply to the 
format within the Drill execution engine? I'm here to suggest that columnar has 
no advantage, and many disadvantages, when used as the *internal* format of an 
execution engine. "Thems is fighting words", so let's bring it on.
   
   I've had the pleasure of working with several query engines: Drill 
(columnar) and Impala (row-based) are two well-known examples. This has given 
me a unique opportunity to see if all the marketing claims for columnar (which 
still appear in the videos on Drill's website) actually hold up in practice. 
Spoiler: they don't.
   
   This is a PR about optimization. A good rule in optimization is to start 
with the biggest issues, then work toward the details. So, rather than tinker 
with the details of vector execution, let's look at the fundamental issues. I 
hope this will help us avoid confusing Drill's (and Arrow's) marketing with 
reality.
   
   **Myth: Vectorized execution**: The biggest myth is around vectorized 
execution. Of course, a minor myth is that Drill uses such execution (it 
doesn't.) The bigger myth is that, if we invested enough, it could.
   
   Vectorized execution is great when we have a simple operation we apply to a 
large amount of data. Think the dot-product operation for neural networks, or 
data compression, or image transforms, or graphics. In all cases, we apply a 
simple operation (rescale, say) to a large amount of homogeneous data (the 
pixels in an image.)
   
   So, the question is, does typical, real-world SQL fit this pattern? I've now 
seen enough crufty, complex, messy real-world queries to suggest that, no, SQL 
is not a good candidate for vectorization. `SELECT` and `WHERE` clauses embed 
business logic, and that logic is based on messy human rules, not crisp, clean 
mathematics. The resulting SQL tends to have conditionals (`WHEN` or `IF()`, 
etc.), lots of function calls (all those cool UDFs which @cgivre has written), 
and so on. Plus, as noted above, SQL deals with NULL values, which must 
short-circuit entire execution paths.
   
   Hence, even if we could vectorize simple operations, we'd find that, in most 
queries, we could not actually use that code.
   
   **Myth: Vectors are CPU Cache Friendly**: The second big myth is that 
vectors are somehow more "friendly" to the CPU L1 cache than a row format. The 
idea is that one can load a vector into the L1 cache, then zip through many 
values in one go. This myth is related to the above one.
   
   First, SQL expressions are not based on columns, they are based on rows. 
Each calculation tends to involve multiple columns: `net_receipts = sales + 
taxes - returns`. Here each calculation touches four vectors, so we need all 
four to be in the CPU cache to benefit.
   
   Second, SQL is row based: that above calculation is just one of perhaps many 
that occur on each row. In the ideal case, the calculations for independent 
groups: `SELECT a + b AS x, c - d + e AS y, f / g AS z, ...`. In this case, we 
could load vectors ``a, `b`, `x` into the L1 cache, do the calcs, then load 
`c`, `d`, `e` and y in the cache and so on. Of course, Drill doesn't work this 
way (it does all the calculations for a single row before moving to the next), 
but it could, and it would have to to benefit from vectorization.
   
   A more typical case is that the same column is used in multiple expressions: 
`SELECT a + b AS x, a / c AS y, (a - d) * e AS z, ...` In this case, we must 
load the `a` vector into the L1 cache multiple times. (Or, more properly, its 
values would continually be bumped out of the cache, then reloaded.)
   
   **Myth: Bigger Vectors are Better**: Drill went though a phase when everyone 
bought into the "L1 cache" myth. To get better performance everyone wanted ever 
larger vectors. In the code, you'll see that we started with 1K-row batches, 
then it grew to 4K, then other code would create 64K row batches. It got so bad 
we'd allocate vectors larger than 16MB, which caused memory fragmentation and 
OOM errors. (This is the original reason for what evolved to be "EVF": to 
control vector sizes to prevent memory fragmentation - very basic DB stuff.)
   
   Remember, the CPU L1 cache is only about 256K in size. A 4MB vector is 
already 16x the L1 cache size. Combine that with real-world expressions and we 
end up with a "working set" of 10s of MB in size: 20x or more the L1 cache 
size. The result is lots of cache misses. (This stuff is really hard to 
measure, would be great for someone to do the experiments to show this 
happening in practice.)
   
   **Myth: Vectors are Efficient**: A related, minor myth is that writing to 
vectors is more efficient than writing to rows. This is not true in general. In 
Drill, it is especially false. Originally, vectors were allocated at some small 
initial size (256K? Something like that.) Vectors grow as we add data. (That's 
one of the things that make working with them difficult.) Fill the 256K vector? 
We double it to 512K and copy across the data. We do again at 1MB, 2MB, 4MB, 
... In the end, to grow to 4MB, copy about 4MB of data. That is 4MB of reads, 
4MB of writes, in addition to the 4MB of writes needed to create the data.
   
   Later, a bunch of ad-hoc "batch sizing" code was added to try to guess a 
good initial size for vectors. Not too hard or fixed-width vectors (`INT`, 
say), but rather tricky for variable-width vectors (`VARCHAR`).
   
   Remember that each vector is sized and grows independently. So, to create a 
batch, we have to track the size of every vector, grow those that need growing, 
but not over-allocate all the vectors because the space is not fungible: vector 
`a` can't "borrow" unused space from vector `b`.
   
   The point is, Drill needs a large amount of complex, add-on code just to 
work around the fact that every vector will grow, and copy data, if not sized 
correctly, and, in general, we don't know ahead of time what the size should 
be. The result is inefficiency.
   
   **Myth: Vectors Allow Efficient Parquet Integration**: The idea here is that 
Parquet is columnar, vectors are columnar, so we just read Parquet directly 
into vectors. The reality is that Parquet is a highly-encoded format that 
requires a large amount of complex code to decode. Drill does the decoding 
value-by-value, there is no "direct copy", nor could there be. Parquet works 
with Drill's value vectors, Arrow vectors, or Impala rows equally well: in each 
case, the Parquet data has to be decoded value-by-value, then written to the 
target format.
   
   **Single Data Format for In-Memory and Over-the-Wire**: One of Drills' 
claims to fame is that value vectors (and, later Arrow vectors) use the same 
layout in memory as over the wire, leading to efficient exchanges via RPC. This 
myth is true, as far as it goes. But, if you look at the code, the truth is 
much more complex. On the sender side, vectors are independent buffers (as 
explained above.) On the receiver side, the whole message comes in as one big 
buffer. Special code slices up that buffer to recover vectors. A vast amount of 
complex code is needed to handle the accounting. (Thanks to Chris, a one-time 
Drill contributor who wrote all that code well enough that it "just works". It 
would be hell to fix otherwise.)
   
   **Complex Client Code**: All DB clients work a row at a time. Whether it is 
JDBC, Sql Alchemy, your home grown code, or whatever, a client consumes data a 
row at a time. All DB APIs (except Drill) are row-based. The client asks for 
the next row (or, more typically, the next *n* rows), then reads them 
one-by-one. Super simple and has worked for decades.
   
   With Drill, a client has to include all of Drill's complex vector logic so 
it can read a batch of a 1K (or 4K or more) rows split across vectors. The 
client then has to walk the vectors to assemble the row that the client really 
wants. The result is that Java clients (such as JDBC) pull in a vast amount of 
Drill code. The C++ ODBC client was a nightmare: pretty much only one person 
ever could make it work (thanks, Parth!).
   
   The complexity of this simple client operation has led clients to use the 
REST API instead. But, Drill is session-based (needed to set Drill's zillions 
of session options), but REST is not. Drill works for "big data", but REST 
delivers the results in one big blob. The result is that that the vector-based 
client is so hard to use that folks want to use the REST client, which doesn't 
work in the general case either. We've shot ourselves in the foot. Yet, every 
other DB on the planet has a simple row-based client API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] jnturton commented on issue #2421: ValueVectors replacement

Reply via email to