yjshen opened a new issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   **_Many pipeline-breaking operators are inherently row-based:_** 
   
   For sort that would shuffle records around, re-order would cause random 
memory access patterns for each column in the current columnar organization. 
The performance will deteriorate as the number of columns grows. Besides, the 
compound sort key also requires us to access different columns.
   On the other hand, row-based representation avoids this problem (performance 
deteriorates with payload column number growth). we can check 
[here](https://dl.acm.org/doi/10.1145/1409360.1409380) for more explanations.
   
   For hashtable entries that we buffer aggregation state, we are already 
utilizing a row-based format indirectly -- We use `Vec<ScalarValue>` as a state 
for each key. Vector of `ScalarValue` is mostly stored continuously in memory 
but faced with two kinds of inefficiency: 1. memory overhead introduced by 
`ScalarValue` enum (16bytes per field according to @alamb ); 2. string or other 
non-primitive values stored on the heap elsewhere and accessed through 
pointers. 
   
   ```text
   ┌───────────────────────────────────────────────────────┐
   │                                                       │
   │ ┌────────────────┬────────────────┬────────────────┐  │
   │ │  ScalarValue   │  ScalarValue   │  ScalarValue   │  │
   │ │    ::Int(5)    │   ::Int(10)    │    ::Int(3)    │  │
   │ └────────────────┴────────────────┴────────────────┘  │
   │   Hash Table Entry                                    │
   │   Vec<ScalarValue>                                    │
   └───────────────────────────────────────────────────────┘
    When the keys are primitive values, they are stored     
                  contiguously in the Vec                   
   
   
                              ┌ ─ ─ ─ ─ ─ ─ ─ ─ ┐           
                                     "foo"                  
                              │(heap allocation)│           
                               ─ ─ ─ ─ ─ ─ ─ ─ ─            
                                       ▲                    
                               ┌───────┘                    
   ┌───────────────────────────┼───────────────────────────┐
   │                           │                           │
   │ ┌────────────────┬────────────────┬────────────────┐  │
   │ │  ScalarValue   │  ScalarValue   │  ScalarValue   │  │
   │ │    ::Int(5)    │ ::Utf8("foo")  │    ::Int(3)    │  │
   │ └────────────────┴────────────────┴────────────────┘  │
   │   Hash Table Entry                                    │
   │   Vec<ScalarValue>                                    │
   └───────────────────────────────────────────────────────┘
     When the keys have strings/binary data, the variable   
      length data is stored non contiguously in the Vec     
   ```
   _I quote these two great diagrams above from @alamb. Thanks again!_
   
   
   For join, whether hash-based or sort-based, would suffer from similar 
problems as above.
   
   **Describe the solution you'd like**
   
   1. A `vec<u8>` based representation for tuple, store all columns 
continuously in memory, for row-logic operations.
   2. Efficient coding/decoding method from/to columnar arrow data.
   3. Access cells in `vec<u8>` tuple efficiently. 
   
   We could refer to PostgreSQL / DuckDB / Spark for the row format design. But 
note Spark's `UnsafeRow` incurs a lot of memory overhead due to its 8-byte 
alignment.
   
   **Describe alternatives you've considered**
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to