Re: [I] Allow providing Arrow schema when scanning Parquet files [datafusion]

via GitHub Fri, 14 Mar 2025 06:28:44 -0700


HawaiianSpork commented on issue #5950:
URL: https://github.com/apache/datafusion/issues/5950#issuecomment-2724596144


   > > This should be fixed now by 
https://github.com/apache/datafusion/pull/10515. You can now override the 
schema used in the file scanner using the SchemaAdapter.
   > 
   > Doesn't the SchemaAdapter _convert_ the schema that was already read? So 
it doesn't really solve the issue.
   > 
   > Does passing in a schema to 
[FileScanConfig](https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.FileScanConfig.html#structfield.file_schema)
 not work? Or is this request specifically for a Python API?
   
   The file_schema in FileScanConfig can be used to coarse the schema read from 
parquet into the supplied schema using arrow cast. If, however, you need 
functionality beyond cast (for example to add columns that don't exist in some 
of the parquet files) than schemaAdapter can be used to convert the data 
returned before it is used by datafusion.  This allows the extension of the 
parquet table provider. Otherwise, a new table provider would need to be 
created. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Allow providing Arrow schema when scanning Parquet files [datafusion]

Reply via email to