alamb commented on issue #7647: URL: https://github.com/apache/arrow-datafusion/issues/7647#issuecomment-1879161073
TLDR I recommend we revert this change and reopen this ticket while we reconsider how to handle this case better. # Background This ticket caused a functional regression for us downstream in https://github.com/apache/arrow-datafusion/issues/8738 (a query that used to run started erroring). The cause of the issue is that the `LogicalPlan` and `ExecutionPlan`s schemas no longer match. I started exploring making them match in https://github.com/apache/arrow-datafusion/pull/8766 However while working on that issue, I thought more about this change and I am not sure the change described in this issue is good for multi column groupings at all. # Rationale Dictionary Encoding on single column group keys is bad because there is no repetition in the data and therefore the Dictionary encoding is pure over head. For example, given ```sql SELECT ... GROUP BY a ``` Dictionary Encoding is worse than native encoding: ``` ┌──────┐ ┌──────┐ ┌────┐ │ foo │ │ foo │ │ 0 │ ├──────┤ ├──────┤ ├────┤ │ bar │ │ bar │ │ 1 │ ├──────┤ ├──────┤ ├────┤ │ baz │ ────────▶ │ baz │ │ 2 │ ├──────┤ ├──────┤ ├────┤ │ ff │ │ ff │ │ 3 │ ├──────┤ ├──────┤ ├────┤ │ .. │ │ .. │ │ .. │ ├──────┤ ├──────┤ ├────┤ │ aaz │ │ aaz │ │9999│ └──────┘ └──────┘ └────┘ Group Values values array keys array (distinct values of a) ``` However, the story is different when there is a multi column group key, as in that case, dictionary encoding each column can be a significant performance improvement as they are applied to each column individually and each column may have substantial redundancy. For example, given this query ```sql SELECT ... GROUP BY a,b ``` ``` ┌──────┐ ┌──────┐ ┌──────┐ ┌────┐ ┌──────┐ ┌────┐ │ foo │ │ tag1 │ │ foo │ │ 0 │ │ tag1 │ │ 0 │ ├──────┤ ├──────┤ ├──────┤ ├────┤ ├──────┤ ├────┤ │ foo │ │ tag2 │ │ baz │ │ 0 │ │ tag2 │ │ 1 │ ├──────┤ ├──────┤ └──────┘ ├────┤ ├──────┤ ├────┤ │ foo │ │ tag3 │ ────────▶ │ 0 │ │ tag3 │ │ 2 │ ├──────┤ ├──────┤ ├────┤ ├──────┤ ├────┤ │ foo │ │ tag4 │ │ 0 │ │ tag4 │ │ 3 │ ├──────┤ ├──────┤ ├────┤ └──────┘ ├────┤ │ .. │ │ .. │ │ .. │ │ .. │ ├──────┤ ├──────┤ ├────┤ ├────┤ │ baz │ │ tag4 │ │ 1 │ │ 3 │ └──────┘ └──────┘ └────┘ └────┘ Group Values values array keys array values array keys array (distinct values of (a) (a) (b) (b) a,b) ``` This could especially impact multi-phase grouping where dictionary encoding will save significant time hashing values for low cardinality string columns. In fact we think we may have seen a performance regression when picking up this change downstream as well, which could also be explained by the observation abive Thus I recommend we revert this change via https://github.com/apache/arrow-datafusion/pull/8740 while we reconsider how to handle this case (maybe just for single column group by? Maybe do the dictionary encoding within the `RowEncoder` to avoid generating many redudant strings? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
