[PR] Implement schema adapter support for FileSource and add integration tests [datafusion]

via GitHub Thu, 22 May 2025 06:31:10 -0700


kosiew opened a new pull request, #16148:
URL: https://github.com/apache/datafusion/pull/16148


   
   
   ## Which issue does this PR close?
   
   This is part of a series of PRs re-implementing #15295 to close #14657 by 
adding schema‐evolution support for:  
   - listing‐based tables  
   - nested structs  
   in DataFusion.
   
   ## Rationale for this change
   
   To enable customizable schema evolution during file scans, we[ introduce a 
`SchemaAdapterFactory` hook into all `FileSource` 
implementations](https://github.com/apache/datafusion/pull/15295#discussion_r2100959986).
 This allows users to adapt column mappings and perform transformations (e.g., 
renaming, casting, adding defaults) without forking core scan logic.
   
   ## What changes are included in this PR?
   
   - **Core API additions**  
     - Added `with_schema_adapter_factory` and `schema_adapter_factory` methods 
to the `FileSource` trait  
     - Introduced the `impl_schema_adapter_methods!()` macro to reduce 
boilerplate in each `FileSource` implementation  
     - Added `as_file_source` helper to convert concrete sources into `Arc<dyn 
FileSource>`
   
   - **Datasource crate updates**  
     - Updated CSV, JSON, Avro, Parquet, and Arrow `FileSource` implementations 
to store and honor an optional `schema_adapter_factory`  
     - Applied the new macro and helper consistently across all `FileSource` 
implementations
   
   - **Testing**  
     - Added unit tests:
       - `schema_adapter_factory_tests.rs`
       - `test_adapter_updated.rs`
       - `test_source_adapter_tests.rs`  
       These cover factory wiring, column index mapping, schema transformation 
logic, and source behavior
     - Added integration tests:
       - `schema_adapter_integration_tests.rs`
       - `apply_schema_adapter_tests.rs`  
       These validate adapter behavior in real-world scenarios such as scanning 
Parquet files
   
   ## Are these changes tested?
   
   Yes. This PR includes comprehensive new tests to ensure:
   1. Default behavior is preserved when no schema adapter is used
   2. Factories can be injected and retrieved via the new API
   3. Adapters correctly map schemas and record batches
   4. The system works end-to-end with real file formats like Parquet
   
   ## Are there any user-facing changes?
   
   Yes:
   - Public API additions to the `FileSource` trait
   - New macro `impl_schema_adapter_methods!()` for downstream implementors
   
   These changes are additive and backward-compatible. Developers implementing 
custom `FileSource` types must either use the macro or provide the new methods 
to support schema adapters.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Implement schema adapter support for FileSource and add integration tests [datafusion]

Reply via email to