yjshen opened a new issue #1708: URL: https://github.com/apache/arrow-datafusion/issues/1708
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** **_Many pipeline-breaking operators are inherently row-based:_** For sort that would shuffle records around, re-order would cause random memory access patterns for each column in the current columnar organization. The performance will deteriorate as the number of columns grows. Besides, the compound sort key also requires us to access different columns. On the other hand, row-based representation avoids this problem (performance deteriorates with payload column number growth). we can check [here](https://dl.acm.org/doi/10.1145/1409360.1409380) for more explanations. For hashtable entries that we buffer aggregation state, we are already utilizing a row-based format indirectly -- We use `Vec<ScalarValue>` as a state for each key. Vector of `ScalarValue` is mostly stored continuously in memory but faced with two kinds of inefficiency: 1. memory overhead introduced by `ScalarValue` enum (16bytes per field according to @alamb ); 2. string or other non-primitive values stored on the heap elsewhere and accessed through pointers. ```text ┌───────────────────────────────────────────────────────┐ │ │ │ ┌────────────────┬────────────────┬────────────────┐ │ │ │ ScalarValue │ ScalarValue │ ScalarValue │ │ │ │ ::Int(5) │ ::Int(10) │ ::Int(3) │ │ │ └────────────────┴────────────────┴────────────────┘ │ │ Hash Table Entry │ │ Vec<ScalarValue> │ └───────────────────────────────────────────────────────┘ When the keys are primitive values, they are stored contiguously in the Vec ┌ ─ ─ ─ ─ ─ ─ ─ ─ ┐ "foo" │(heap allocation)│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ▲ ┌───────┘ ┌───────────────────────────┼───────────────────────────┐ │ │ │ │ ┌────────────────┬────────────────┬────────────────┐ │ │ │ ScalarValue │ ScalarValue │ ScalarValue │ │ │ │ ::Int(5) │ ::Utf8("foo") │ ::Int(3) │ │ │ └────────────────┴────────────────┴────────────────┘ │ │ Hash Table Entry │ │ Vec<ScalarValue> │ └───────────────────────────────────────────────────────┘ When the keys have strings/binary data, the variable length data is stored non contiguously in the Vec ``` _I quote these two great diagrams above from @alamb. Thanks again!_ For join, whether hash-based or sort-based, would suffer from similar problems as above. **Describe the solution you'd like** 1. A `vec<u8>` based representation for tuple, store all columns continuously in memory, for row-logic operations. 2. Efficient coding/decoding method from/to columnar arrow data. 3. Access cells in `vec<u8>` tuple efficiently. We could refer to PostgreSQL / DuckDB / Spark for the row format design. But note Spark's `UnsafeRow` incurs a lot of memory overhead due to its 8-byte alignment. **Describe alternatives you've considered** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org