[GitHub] [arrow-rs] tustvold opened a new issue, #2677: Arrow Row Format

GitBox Wed, 07 Sep 2022 07:46:32 -0700


tustvold opened a new issue, #2677:
URL: https://github.com/apache/arrow-rs/issues/2677


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   <!--
   A clear and concise description of what the problem is. Ex. I'm always 
frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for 
this feature, in addition to  the *what*)
   -->
   
   I think this crate has pretty good stories for operating on individual 
columns, either by downcasting to a concrete type, or invoking a `dyn` kernel.
   
   The stories for multi-column operations are substantially weaker, with 
patchy support for common multi-column operations such as sorts, groupings, 
aggregations, reassembly, etc... We have some pieces such as 
`MutableArrayData`, `DynComparator`, but they're not especially performant, 
making extensive use of dynamic dispatch at the row-level, nor easy to use.
   
   **Describe the solution you'd like**
   <!--
   A clear and concise description of what you want to happen.
   -->
   
   Having a first-class row representation will not only allow us to implement 
more performant versions of existing kernels such as lexsort, but also provide 
a pretty compelling primitive to downstreams with which to implement more 
advanced operations such as streaming merges, joins, aggregates, etc... There 
is also precedent, with the C++ arrow library providing its own row format.
   
   **Goals**
   
   * Each row should be encoded as a single sequence of bytes
   * Comparison of the byte arrays should be sufficient to establish ordering 
of the rows
   * It should be possible to convert a selection of rows back to arrays
   
   **Non-Goals**
   
   * Support introspection or mutation of the row values
   * Provide a stable encoding for FFI, IO, etc...
   * Provide "optimal" encoding, rather a reasonable out-of-the-box baseline 
for common use-cases
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features 
you've considered.
   -->
   
   We could extend the row format in DataFusion, however, this would limit its 
benefits to DataFusion. I think a row-oriented representation is such a 
fundamental primitive that it makes sense for inclusion in arrow-rs, so that it 
can be both used in its kernels and by downstreams that don't make use of 
DataFusion.
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold opened a new issue, #2677: Arrow Row Format

Reply via email to