Dandandan commented on issue #1708: URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1028712924
Thanks for the research/overview! Taking inspiration from DuckDB / PostgreSQL sounds reasonable to me. I am wondering if for certain operations, e.g. hash aggregate, I feel fixed size input the data is stored better in a columnar format (mutable array, with offsets), which can have faster (vectorized) operations (batch updating state values) and faster (free) conversion to a columnar array. Another row-based format (like we have now, or a more "advanced" one) would spend some extra time in: * Converting to the row-wise format values * Interpreting the row-wise format (accessing cells based on data types) * Generating columnar data The story is probably very different for sorting, I still need to read the DuckDB post in detail. > For join, whether hash-based or sort-based, would suffer from similar problems as above I think it isn't is the case for hash join, I think there is no need to have a row reprentation (as we can keep the left side data in columnar format in memory, we don't mutate the data). On Thu, Feb 3, 2022, 08:22 QP Hou ***@***.***> wrote: > Thanks @yjshen <https://github.com/yjshen> for the detailed research, it > looks like postgres's design might be better assuming we only access row > values sequentially the majority of the time. I think this is the case for > our current hash aggregate and sort implementation? > > — > Reply to this email directly, view it on GitHub > <https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1028677253>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AABH7GJERPANWE7W7QJDGWLUZIULXANCNFSM5NEAJ2CQ> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> > -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
