DanielCarter-stack commented on issue #10377: URL: https://github.com/apache/seatunnel/issues/10377#issuecomment-3795840875
<!-- code-pr-reviewer --> Thanks for reporting this. I can confirm the issue exists when schema evolution occurs in the same directory. **Root cause**: The connector only infers schema from the first file (`filePaths.get(0)`) in the list. If that file has an older schema, new fields will be missed. This affects PARQUET/ORC/BINARY formats when no explicit `schema` is configured. **Evidence**: - `BaseFileSourceConfig.java:114` uses `filePaths.get(0)` for schema inference - `ParquetReadStrategy.java:287-332` reads schema from a single file's footer only - No logic exists to merge schemas across multiple files **Workarounds**: 1. Explicitly specify `schema` in your config with all fields (including new ones) 2. Rename files so the newest schema file appears first alphabetically 3. Use separate directories for different schema versions **Could you confirm**: 1. Did you explicitly configure `schema` in your source config? 2. Can you verify the first file (alphabetically) in your HDFS directory has the old schema? If you'd like to contribute a fix, we could add a schema merge option (similar to Spark's `mergeSchema`), though scanning all file footers would increase initialization time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
