adriangb commented on PR #22706:
URL: https://github.com/apache/datafusion/pull/22706#issuecomment-4602513773

   Before I review the 2k LOC a couple of questions:
   
   > The most common offender in real workloads is a single nested column (e.g. 
LargeList<Struct<Utf8, LargeUtf8>>) that drags an otherwise primitive-heavy 30+ 
column GROUP BY onto the slow path.
   
   Is "most common offender" from your production data? Are there any other 
unsupported cases (e.g. are all dates, etc. supported)? A table / confirmation 
of what is and what is not supported would be great.
   
   > This PR adds column-native GroupColumn implementations for 
FixedSizeList<primitive>, Struct<...>, List<T>, and LargeList<T>, so a GROUP BY 
containing any of these no longer falls back from GroupValuesColumn to 
GroupValuesRows.
   
   Since you're adding support for multiple things my immediate ask would be: 
can we split this into 1 PR prepping whatever refactoring is needed + 1 PR per 
type we are adding support for? You can prototype this by refactoring this PR 
into stacked commits then we can split it once that looks good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to