[GitHub] [arrow-datafusion] alamb opened a new issue #1248: Optimized `RecordBatch` for constant columns

GitBox Fri, 05 Nov 2021 03:49:31 -0700


alamb opened a new issue #1248:
URL: https://github.com/apache/arrow-datafusion/issues/1248



   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Certain record batches have columns that entirely the same value -- for 
example the partitioning column in 
https://github.com/apache/arrow-datafusion/pull/1141
   
   In that case the `ListingProvider` adds a column with a constant value based 
on the path in the file system
   
   So a file in a path like `/foo/date=2021-11-5/bar.parquet` would end up 
producing one or more RecordBatches with a `date` column that had the value 
`2021-11-5`. 
   
   At the moment, to represent such a column using an Arrow array we have to 
repeat the value `2021-11-5` over and over again which is inefficient both in 
space storage as well as in processing. 
   
   | date | other_col |
   | -----| ----------|
   | 2021-11-5 | 1 |
   | 2021-11-5 | 2 |
   | 2021-11-5 | 3 |
   | 2021-11-5 | 4 |
   | ... | ... |
   | 2021-11-5 | 1000000000 |
   
   
   **Describe the solution you'd like**
   
   
   It seems that having a sort of RecordBatch that can contain constant columns 
represented by a scalar value (like the C++ ExecBatch) is a pretty nice 
abstraction that will help adding tons of optimizations. I didn't find an issue 
summarizing this discussion and conclusion, should we create one?
   
   According to @westonpace  , in C++ there is `ExecBatch` in addition to 
`RecordBatch` for this purpose.  `ExecBatch` is used within the exec plan and 
it's pretty much identical to record batch except:
   
   * No schema
   * No field names
   * Column values can be scalars
   
   Note DataFusion already has a Array/Scalar notion in `ColumnarValue` 
https://github.com/apache/arrow-datafusion/blob/a1c794cec233f7fe34f34c7d64f529625a507669/datafusion/src/physical_plan/mod.rs#L388-L395
 but its use is confined to expression evaluation at the moment. 
   
   Perhaps we could build something like `DFRecordBatch` akin to `DFSchema` 
(aka wrap the arrow `RecordBatch` with methods that allow the columns to be 
`ColumnarValue`)
   
   **Describe alternatives you've considered**
   We could use a `Dictionary(Int8, Utf8)` array to do better (where each 
repeated string value only requires a single `i8` (8 bytes) but it still 
requires ~ 8 bytes * the number of total rows in the `RecordBatch`
   
   Another  classic approach I have seen for this kind of problem is to use 
[RLE encoding](https://en.wikipedia.org/wiki/Run-length_encoding) so that these 
“almost entirely constant columns” become a very small number of RLE runs.
   
   I think RLE has been proposed for Arrow several times before, but it doesn’t 
have the “access array[i]” in constant time” property required for arrays
   
   **Additional context**
   (this is my transcription of some information in an ASF slack thread: 
https://the-asf.slack.com/archives/C01QUFS30TD/p1634810772007000) 
   
   cc @houqp  @rdettai @Dandandan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb opened a new issue #1248: Optimized `RecordBatch` for constant columns

Reply via email to