tustvold opened a new issue #2079:
URL: https://github.com/apache/arrow-datafusion/issues/2079


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   Currently the file scan operators such as `ParquetExec`, `CsvExec`, etc... 
are created with a `FileScanConfig`, which internally contains a list of 
`PartitionedFile`. These `PartitionedFile` are provided grouped together in 
"file groups". For each of these groups, the operators expose a DataFusion 
partition which will scan these files sequentially.
   
   Whilst this works, I'm getting a little bit concerned we are ending up with 
quite a lot of complexity within each of the individual, file-format specific 
operators:
   
   * The individual files within a file group may have differing schema - #1669
   * If using Hive partitioning, need to project the additional rows from the 
partition key - #1139
   * Edge cases where the file isn't even needed - #1999
   * Intra-file parallelism - #1990
   
   This in turn comes with some downsides:
   
   * Code duplication between operators for different formats with potential 
for feature and functionality divergence
   * The operators are getting very large and quite hard to reason about
   * Execution details are hidden from the physical plan, potentially limiting 
parallelism, optimisation, introspection, etc...
   * Catalog details, such as the partitioning scheme, leak into the physical 
operators themselves
   
   **Describe the solution you'd like**
   
   It isn't a fully formed proposal, but I wonder if instead of continuing to 
extend the individual file format operators we might instead compose together 
simpler physical operators within the query plan. Specifically I wonder if we 
might make it so that the `ParquetExec`, `CsvExec` operators handle a single 
file, and the plan stage within `TableProvider::scan` instead constructs a more 
complex physical plan containing the necessary `ProjectionExec`, 
`SchemaAdapter` operators as necessary.
   
   For what it is worth, IOx uses a similar approach 
https://github.com/influxdata/influxdb_iox/blob/main/query/src/provider.rs#L282 
and it works quite well.
   
   **Describe alternatives you've considered**
   
   The current approach could remain
   
   **Additional context**
   
   I'm trying to take a more holistic view on what the parquet interface 
upstream should look like, which is somewhat related to this 
https://github.com/apache/arrow-rs/issues/1473
   
   FYI @rdettai @yjshen @alamb 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to