[I] Allow suppling a table schema to ParquetExec [datafusion]

via GitHub Thu, 15 Aug 2024 02:49:09 -0700


nrc opened a new issue, #12010:
URL: https://github.com/apache/datafusion/issues/12010


   ### Is your feature request related to a problem or challenge?
   
   We have a couple of situations where our schema evolves and we wish to read 
both old and new data and operate on it seamlessly. We are reading from Parquet 
and in theory this should just work because of the internal `SchemaAdapter`, 
however, in practice we can't make this work without doing work which feels 
like abstraction-breaking. This has happened when we've changed regular columns 
into partition columns, and more generally when we've reordered or otherwise 
changed schemas in minor ways.
   
   In more detail, we're implementing `TableProvider` and we have a logical 
schema which we return from `TableProvider::schema` pushed down projections are 
passed to us based on this schema. In our `scan` method, we create different 
`ParquetExec`s for each version of the file schema using `FileScanConfig`. We 
then union these execs together to make our data source. To do this, their 
schemas must match and we ensure they match our logical schema. However, doing 
so is painful: we have to create a new projection to reorder the fields, and 
compose this with the pushed-down projection. We have to do some manipulation 
of the file schema and partition columns (presumably some of this is 
unavoidable, but it seems unnecessary that we the logical schema, a file schema 
that we pass in, and a file schema found from the file). This is made more 
difficult by the fact that you can't control where the partition columns appear 
in the logical schema, they always end up at the end.
   
   Although `ParqetExec` has an internal schema adapter, this is not very 
useful (in this case) because it's 'target' schema is always the file schema 
from the `FileScanConfig` (in `<ParquetExec as ExecutionPlan>::execute`).
   
   ### Describe the solution you'd like
   
   I'm not sure exactly what this should look like. I think I would like to 
supply a table schema which describes the output of the `ParquetExec`, is used 
as the 'target' schema for the `SchemaAdapter`, specifies the location of 
partition columns, and is automagically applied to passed-down projections. In 
other words, the 'reader' is able to fully encapsulate schema changes.
   
   ### Describe alternatives you've considered
   
   A few alternatives which would make this situation easier to handle without 
this 'complete' change:
   * Be able to specify the location of parititon columns within a schema, not 
just at the end.
   * Provide functionality to compose, invert, and otherwise manipulate 
projections (this would probably require using a projection newtype around 
`Vec<usize>`, which I think would be a good thing anyway).
   * Move the schema adapter factory from `ParquetAdapter` to `FileScanConfig` 
(I'm not sure this would actually help at all and I appreciate that other 
readers might not be able to apply the adaptation, but it feels like this could 
help somehow in making projection/partition handling/schema adaption be better 
integrated).
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Allow suppling a table schema to ParquetExec [datafusion]

Reply via email to