echauchot commented on pull request #15725: URL: https://github.com/apache/flink/pull/15725#issuecomment-851977090
> Same as in the other PR please rebase onto 1.13 and update the target branch accordingly. > > A high-level question to get me faster up-to-speed: > > * In Avro, we have reader and writer schema. If schema evolves, the writer schema of each record updates and through schema compability, I still get the equivalent record in the reader schema automatically. So for Avro, I'd usually specify an additional schema to make sure that my application is forward and backward compatible. > * Now it seems like Parquet (haven't checked the details yet), there is a similar concept. Having a particular reader schema is even more important as it allows us to skip reading large chunks of the file if a specific column is not needed thanks to the columnar layout of the file. > * Is your change now effectively disabling the reader schema? Or can it just be omitted and assumed to be the writer schema? > * How would it work when I read 2 parquet files with different schemas but both can be mapped to the same reader schema? For example, consider a schema evolution case, where 1 file is written by pipeline v1 and file 2 is written by pipeline v2 with an additional column that is ignored in the consuming Flink application. My PR actually solves a bug: ParquetInputFormat takes parquet schema as user input but after split it reads the parquet schema again here https://github.com/apache/flink/blob/52dcf439bb0b8d613fff1efecf015052d5b3a10b/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/ParquetInputFormat.java#L170. So if a user provides a read schema then, he will be surprised that the actual schema used after split is different. My PR touches only the reading part no change is done to the write schema. Regarding avoiding read the file to determine the schema, extracting the schema from the file is actually already done in https://github.com/apache/flink/blob/52dcf439bb0b8d613fff1efecf015052d5b3a10b/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/ParquetInputFormat.java#L170. Why is this needed to read the parquet schema from Flink split in _open()_ ? Can't we use the already extracted schema (extracted in the constructor) ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
