One of the main problems that a columnar internal row batch format solves is interpretation overhead - if you switch based on type/expression/etc once per value, it is quite slow. But if you switch once per vector of values, the overhead is minimized.
Impala uses LLVM query compilation to solve that problem in a different way - we can compile specialized code so don't need to switch on each value. http://www.vldb.org/pvldb/vol11/p2209-kersten.pdf has a comparison of query engines that use vectorization vs those that use query compilation and concludes that neither is inherently faster than the other. Recent benchmarking that we've done on TPC-DS supports the idea that Impala's row-based format can be equally performant or faster than pure columnar engines. Other implementation details seem to matter more in practice - different query plans and other optimizations like runtime filters tend to account for bigger differences in performance than the execution engine. Hopefully we can share some results soon... We could probably get performance benefits in some cases by using a columnar representation more (we do to some extent in scans of columnar formats like Parquet and Kudu). E.g. it would allow vectorization of some operations like expression evaluation and hashing. I am honestly not sure if it would be a significant win - it would depend a lot on the quality of the implementation. The big advantage of the row based format (and a reason why we haven't wanted to change it) is that the row-based format can be used internally in all the operators, e.g. in the hash tables for hash aggregations and hash joins. This simplifies the implementation a bit and avoids the overhead of rematerializing rows when outputting from a join. On Tue, Dec 1, 2020 at 3:24 AM 许益铭 <x1860...@gmail.com> wrote: > hi, everyone! > I find impala row batch is row base(Each row is contiguous memory), but > spark presto clickhouse is column base(every column is contiguous > memory), Is there any benefit to doing this? >