tustvold opened a new issue #2079: URL: https://github.com/apache/arrow-datafusion/issues/2079
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** Currently the file scan operators such as `ParquetExec`, `CsvExec`, etc... are created with a `FileScanConfig`, which internally contains a list of `PartitionedFile`. These `PartitionedFile` are provided grouped together in "file groups". For each of these groups, the operators expose a DataFusion partition which will scan these files sequentially. Whilst this works, I'm getting a little bit concerned we are ending up with quite a lot of complexity within each of the individual, file-format specific operators: * The individual files within a file group may have differing schema - #1669 * If using Hive partitioning, need to project the additional rows from the partition key - #1139 * Edge cases where the file isn't even needed - #1999 * Intra-file parallelism - #1990 This in turn comes with some downsides: * Code duplication between operators for different formats with potential for feature and functionality divergence * The operators are getting very large and quite hard to reason about * Execution details are hidden from the physical plan, potentially limiting parallelism, optimisation, introspection, etc... * Catalog details, such as the partitioning scheme, leak into the physical operators themselves **Describe the solution you'd like** It isn't a fully formed proposal, but I wonder if instead of continuing to extend the individual file format operators we might instead compose together simpler physical operators within the query plan. Specifically I wonder if we might make it so that the `ParquetExec`, `CsvExec` operators handle a single file, and the plan stage within `TableProvider::scan` instead constructs a more complex physical plan containing the necessary `ProjectionExec`, `SchemaAdapter` operators as necessary. For what it is worth, IOx uses a similar approach https://github.com/influxdata/influxdb_iox/blob/main/query/src/provider.rs#L282 and it works quite well. **Describe alternatives you've considered** The current approach could remain **Additional context** I'm trying to take a more holistic view on what the parquet interface upstream should look like, which is somewhat related to this https://github.com/apache/arrow-rs/issues/1473 FYI @rdettai @yjshen @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
