One of the main problems that a columnar internal row batch format solves
is interpretation overhead - if you switch based on type/expression/etc
once per value, it is quite slow. But if you switch once per vector of
values, the overhead is minimized.

Impala uses LLVM query compilation to solve that problem in a different way
- we can compile specialized code so don't need to switch on each value.

http://www.vldb.org/pvldb/vol11/p2209-kersten.pdf has a comparison of query
engines that use vectorization vs those that use query compilation and
concludes that neither is inherently faster than the other.

Recent benchmarking that we've done on TPC-DS supports the idea that
Impala's row-based format can be equally performant or faster than pure
columnar engines. Other implementation details seem to matter more in
practice - different query plans and other optimizations like runtime
filters tend to account for bigger differences in performance than the
execution engine. Hopefully we can share some results soon...

We could probably get performance benefits in some cases by using a
columnar representation more (we do to some extent in scans of columnar
formats like Parquet and Kudu). E.g. it would allow vectorization of some
operations like expression evaluation and hashing. I am honestly not sure
if it would be a significant win - it would depend a lot on the quality of
the implementation.

The big advantage of the row based format (and a reason why we haven't
wanted to change it) is that the row-based format can be used internally in
all the operators, e.g. in the hash tables for hash aggregations and hash
joins. This simplifies the implementation a bit and avoids the overhead of
rematerializing rows when outputting from a join.

On Tue, Dec 1, 2020 at 3:24 AM 许益铭 <x1860...@gmail.com> wrote:

> hi, everyone!
> I find impala row batch is row base(Each row is contiguous memory),  but
> spark presto clickhouse is column base(every column is contiguous
> memory), Is there any benefit to doing this?
>

Reply via email to