jnturton commented on issue #2421: URL: https://github.com/apache/drill/issues/2421#issuecomment-1004751317
Paul Rogers wrote: OK, so the above raise the issues we have to consider when thinking about vectors (Drill's or Arrow's.) What is the alternative? Here, I think Impala got it right. Impala uses Parquet (columnar) for *storage*, but rows for *internal* processing. Impala is like an Italian sports car of old: spends lots of time in the shop, but when it works, it is very fast. One of the reasons that Impala is fast is because of the row format. First, let's describe what "row-based" means. It means that columns appear together, as in a C `struct`, with rows packed one after another as in an array of `structs`. This means that the data for a given row is contiguous. There is only one buffer to size. Classic DB stuff that seems unusual only because we're all used to Drill's vector format. Let's look at the same issues above, but from a row-based perspective. **Expression Execution**: With a row-based model, the CPU can easily load a single row into the L1 cache. All our crufty-real-world expression logic works on that single row. So, no matter how messy the expressions, from the CPU's perspective, all the data is in that single row, which fits nicely into the cache. Rows can be small (a few dozen bytes) or large (maybe a 10s of K for long VARCHARs). In either case, they are far smaller than the L1 cache. The row is loaded. Once we move onto the next row, we'll never visit the previous one, so we don't care if the CPU flushes it from the cache. **Memory Allocation**: Rows reside in buffers (AKA "pages"), typically of a fixed size. A reader "pours" data into a row. When the page is full, that last record is copied to the next page. Only that one row is copied, not all the existing data. So, we eliminate the 1X copy + 1X load problem in Drill. Since there is only one page to allocate, memory is simpler. Since pages are of fixed size, memory management is simpler as well. **Exchanges**: Network exchanges are row-based. Rows are self-contained. A network sender can send single rows, if that is efficient, or batches of rows. In our 100-senders-100-receiver example, we could send rows as soon as they are available. The receiver starts working as soon as the first row is available. There is no near-deadlock from excessive buffering. Yes, we would want to buffer rows (into pages) for efficiency. But, in extreme cases, we can send small numbers of rows to keep the DAG flowing. **Clients**: As noted above, row-based clients are the classic solution and are simple to write. We could easily support proper clients in Python, Go, Rust and anything else if we used a row-based format. **Conclusion**: We tend to focus on the "value vector vs. Arrow" discussion. I'm here to say that that is the wrong question: it buys into myths which have hurt Drill for years. The *correct* question is: what is the most efficient format for the use cases where Drill wants to excel? The above suggests that, rather than Arrow, a better solution is to adopt a row-based internal format. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org