[GitHub] [drill] jnturton commented on issue #2421: ValueVectors replacement

GitBox Tue, 04 Jan 2022 04:04:26 -0800


jnturton commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1004751317



   Paul Rogers wrote:
   
   OK, so the above raise the issues we have to consider when thinking about 
vectors (Drill's or Arrow's.) What is the alternative?
   
   Here, I think Impala got it right. Impala uses Parquet (columnar) for 
*storage*, but rows for *internal* processing. Impala is like an Italian sports 
car of old: spends lots of time in the shop, but when it works, it is very fast.
   
   One of the reasons that Impala is fast is because of the row format.  First, 
let's describe what "row-based" means. It means that columns appear together, 
as in a C `struct`, with rows packed one after another as in an array of 
`structs`. This means that the data for a given row is contiguous. There is 
only one buffer to size. Classic DB stuff that seems unusual only because we're 
all used to Drill's vector format.
   
   Let's look at the same issues above, but from a row-based perspective.
   
   **Expression Execution**: With a row-based model, the CPU can easily load a 
single row into the L1 cache. All our crufty-real-world expression logic works 
on that single row. So, no matter how messy the expressions, from the CPU's 
perspective, all the data is in that single row, which fits nicely into the 
cache.
   
   Rows can be small (a few dozen bytes) or large (maybe a 10s of K for long 
VARCHARs). In either case, they are far smaller than the L1 cache. The row is 
loaded. Once we move onto the next row, we'll never visit the previous one, so 
we don't care if the CPU flushes it from the cache.
   
   **Memory Allocation**: Rows reside in buffers (AKA "pages"), typically of a 
fixed size. A reader "pours" data into a row. When the page is full, that last 
record is copied to the next page. Only that one row is copied, not all the 
existing data. So, we eliminate the 1X copy + 1X load problem in Drill. Since 
there is only one page to allocate, memory is simpler. Since pages are of fixed 
size, memory management is simpler as well.
   
   **Exchanges**: Network exchanges are row-based. Rows are self-contained. A 
network sender can send single rows, if that is efficient, or batches of rows. 
In our 100-senders-100-receiver example, we could send rows as soon as they are 
available. The receiver starts working as soon as the first row is available. 
There is no near-deadlock from excessive buffering.
   
   Yes, we would want to buffer rows (into pages) for efficiency. But, in 
extreme cases, we can send small numbers of rows to keep the DAG flowing.
   
   **Clients**: As noted above, row-based clients are the classic solution and 
are simple to write. We could easily support proper clients in Python, Go, Rust 
and anything else if we used a row-based format.
   
   **Conclusion**: We tend to focus on the "value vector vs. Arrow" discussion. 
I'm here to say that that is the wrong question: it buys into myths which have 
hurt Drill for years. The *correct* question is: what is the most efficient 
format for the use cases where Drill wants to excel? The above suggests that, 
rather than Arrow, a better solution is to adopt a row-based internal format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] jnturton commented on issue #2421: ValueVectors replacement

Reply via email to