DanielCarter-stack commented on issue #10377:
URL: https://github.com/apache/seatunnel/issues/10377#issuecomment-3795840875

   <!-- code-pr-reviewer -->
   Thanks for reporting this. I can confirm the issue exists when schema 
evolution occurs in the same directory.
   
   **Root cause**: The connector only infers schema from the first file 
(`filePaths.get(0)`) in the list. If that file has an older schema, new fields 
will be missed. This affects PARQUET/ORC/BINARY formats when no explicit 
`schema` is configured.
   
   **Evidence**:
   - `BaseFileSourceConfig.java:114` uses `filePaths.get(0)` for schema 
inference
   - `ParquetReadStrategy.java:287-332` reads schema from a single file's 
footer only
   - No logic exists to merge schemas across multiple files
   
   **Workarounds**:
   1. Explicitly specify `schema` in your config with all fields (including new 
ones)
   2. Rename files so the newest schema file appears first alphabetically
   3. Use separate directories for different schema versions
   
   **Could you confirm**:
   1. Did you explicitly configure `schema` in your source config?
   2. Can you verify the first file (alphabetically) in your HDFS directory has 
the old schema?
   
   If you'd like to contribute a fix, we could add a schema merge option 
(similar to Spark's `mergeSchema`), though scanning all file footers would 
increase initialization time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to