[GitHub] [arrow-datafusion] Dandandan commented on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

GitBox Thu, 03 Feb 2022 00:17:52 -0800


Dandandan commented on issue #1708:
URL: 
https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1028712924

   Thanks for the research/overview!

   Taking inspiration from DuckDB / PostgreSQL sounds reasonable to me.

   I am wondering if for certain operations, e.g. hash aggregate, I feel fixed
   size input the data is stored better in a columnar format (mutable array,
   with offsets), which can have faster (vectorized) operations (batch
   updating state values) and faster (free) conversion to a columnar array.
   Another row-based format (like we have now, or a more "advanced" one) would
   spend some extra time in:

   * Converting to the row-wise format values
   * Interpreting the row-wise format (accessing cells based on data types)
   * Generating columnar data

   The story is probably very different for sorting, I still need to read the
   DuckDB post in detail.

   > For join, whether hash-based or sort-based, would suffer from similar
   problems as above

   I think it isn't is the case for hash join, I think there is no need to
   have a row reprentation (as we can keep the left side data in columnar
   format in memory, we don't mutate the data).

   On Thu, Feb 3, 2022, 08:22 QP Hou ***@***.***> wrote:

   > Thanks @yjshen <https://github.com/yjshen> for the detailed research, it
   > looks like postgres's design might be better assuming we only access row
   > values sequentially the majority of the time. I think this is the case for
   > our current hash aggregate and sort implementation?
   >
   > —
   > Reply to this email directly, view it on GitHub
   > 
<https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1028677253>,
   > or unsubscribe
   > 
<https://github.com/notifications/unsubscribe-auth/AABH7GJERPANWE7W7QJDGWLUZIULXANCNFSM5NEAJ2CQ>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >

-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan commented on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Reply via email to