[I] Support custom SchemaAdapter on ParquetExec [datafusion]

via GitHub Mon, 06 May 2024 12:51:22 -0700


HawaiianSpork opened a new issue, #10398:
URL: https://github.com/apache/datafusion/issues/10398


   ### Is your feature request related to a problem or challenge?
   
   This is a feature request to allow the ParquetExec type to accept a 
SchemaAdapter instead of having a fixed SchemaAdapter.  By supporting a 
SchemaAdapter to be injected, the same ParquetExec could be reused by a number 
of protocols that build upon parquet.  For example, delta-rs keeps the schema 
separate from the parquet so that schema evolution can be well controlled.  For 
instance, the external schema can enrich the data inside the parquet files with 
missing nested columns or timezone information.
   
   This same pattern may also be useful for other storage formats as well as 
the mapper just accepts the record batch from the file and a desired Table 
Schema.
   
   ### Describe the solution you'd like
   
   ParquetExec accepts a SchemaAdapterFactory which then the ParquetExec will 
call to create SchemaAdapter per file.  The SchemaAdapter returns a 
SchemaMapper (just like it does today) which is used to transform the 
RecordBatch into the desired format.
   
   ### Describe alternatives you've considered
   
   It could be considered that the `ParquetExec` should be closed to 
modification and instead it should either be decorated or new `ExecutionPlan` 
should be built.  There is a lot of parquet specific code in the `ParquetExec` 
which these protocols would have to rebuild.  Alternatively we could change the 
interface for `ExecutionPlan` which would be a breaking change.
   
   Another approach is to say that we don't want to support different ways of 
casting arrow batches to different protocols and all these changes should be 
made in arrow.  I think different applications are going to have different 
constraints about what migrations they choose to support .  For instance, arrow 
today will cast one struct based on the position of the fields, this is great 
for short lived record batches that are trying to just rename fields, but this 
would be problematic for long lived arrow batches stored as parquet as the code 
that wrote the record batch may not be the same that read the record batch.  So 
there is opportunity to both improve arrow but also allow how it is used to 
diverge.
   
   ### Additional context
   
   I've got a code change ready that I can make a PR soon.  
   
   
   We had some conversation about this in discourse here: 
https://discord.com/channels/885562378132000778/1166447479609376850/1236683250244517991


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Support custom SchemaAdapter on ParquetExec [datafusion]

Reply via email to