adriangb commented on PR #22706: URL: https://github.com/apache/datafusion/pull/22706#issuecomment-4602513773
Before I review the 2k LOC a couple of questions: > The most common offender in real workloads is a single nested column (e.g. LargeList<Struct<Utf8, LargeUtf8>>) that drags an otherwise primitive-heavy 30+ column GROUP BY onto the slow path. Is "most common offender" from your production data? Are there any other unsupported cases (e.g. are all dates, etc. supported)? A table / confirmation of what is and what is not supported would be great. > This PR adds column-native GroupColumn implementations for FixedSizeList<primitive>, Struct<...>, List<T>, and LargeList<T>, so a GROUP BY containing any of these no longer falls back from GroupValuesColumn to GroupValuesRows. Since you're adding support for multiple things my immediate ask would be: can we split this into 1 PR prepping whatever refactoring is needed + 1 PR per type we are adding support for? You can prototype this by refactoring this PR into stacked commits then we can split it once that looks good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
