alamb opened a new issue, #10921: URL: https://github.com/apache/datafusion/issues/10921
### Is your feature request related to a problem or challenge? Part of https://github.com/apache/datafusion/issues/10918 In order to take advantage of the parquet writer generating StringViewArrays ( https://github.com/apache/arrow-rs/issues/5530 from @ariesdevil (❤️ ) ) we need to make sure datafusion doesn't immediately cast the array back to `StringView` which would undo the benefits ``` ▲ ┌ ─ ─ ─ ─ ─ ─ ┐ │ After filtering, StringArray │ any unfiltered rows └ ─ ─ ─ ─ ─ ─ ┘ │ are gathered via ... │ the `take` kernel │ ┌────────────────────────────┐ │ │ │ FilterExec │ │ │ └────────────────────────────┘ ▲ ┌ ─ ─ ─ ─ ─ ─ ┐ │ StringArray │ └ ─ ─ ─ ─ ─ ─ ┘ │ Reading String data │ from a Parquet file ... │ results in │ StringArrays passed ┌ ─ ─ ─ ─ ─ ─ ┐ │ StringArray │ └ ─ ─ ─ ─ ─ ─ ┘ │ │ ┌────────────────────────────┐ │ │ │ ParquetExec │ │ │ └────────────────────────────┘ Current situation ``` ### Describe the solution you'd like To support a phased rollout of this feature, I recommend we focus at first on only the first filtering operation Specifically get to the point where the parquet reader will read data out as StringView like this: ``` ▲ ┌ ─ ─ ─ ─ ─ ─ ┐ │ StringArray │ └ ─ ─ ─ ─ ─ ─ ┘ │ ... │ │ ┌────────────────────────────┐ │ │ │ FilterExec │ │ │ └────────────────────────────┘ ┌ ─ ─ ─ ─ ─ ─ ┐ ▲ StringViewArr │ │ ay │ │ ─ ─ ─ ─ ─ ─ ─ │ ... │ │ ┌ ─ ─ ─ ─ ─ ─ ┐ │ StringViewArr │ │ ay │ │ ─ ─ ─ ─ ─ ─ ─ │ │ ┌────────────────────────────┐ │ │ │ ParquetExec │ │ │ └────────────────────────────┘ Intermediate Situation 1: pass StringViewArray between ParquetExec ``` ### Describe alternatives you've considered I suggest we: 1. Make a [configuration setting](https://datafusion.apache.org/user-guide/configs.html) like "force StringViewArray" when reading parquet so we can test this. When this setting is enabled, DataFusion should configure the ParquetExec to produce `StringViewArray` regardless of the type stored in the parquet file 2. Then work on incrementally rolling out support / testing for various filter expressions (especially string functions like substring and https://github.com/apache/datafusion/issues/10919) ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org