[GitHub] [flink] echauchot commented on pull request #15725: [FLINK-21389] determine parquet schema from file instead of taking it from user

GitBox Tue, 01 Jun 2021 02:28:35 -0700


echauchot commented on pull request #15725:
URL: https://github.com/apache/flink/pull/15725#issuecomment-851977090



   > Same as in the other PR please rebase onto 1.13 and update the target 
branch accordingly.
   > 
   > A high-level question to get me faster up-to-speed:
   > 
   > * In Avro, we have reader and writer schema. If schema evolves, the writer 
schema of each record updates and through schema compability, I still get the 
equivalent record in the reader schema automatically. So for Avro, I'd usually 
specify an additional schema to make sure that my application is forward and 
backward compatible.
   > * Now it seems like Parquet (haven't checked the details yet), there is a 
similar concept. Having a particular reader schema is even more important as it 
allows us to skip reading large chunks of the file if a specific column is not 
needed thanks to the columnar layout of the file.
   > * Is your change now effectively disabling the reader schema? Or can it 
just be omitted and assumed to be the writer schema?
   > * How would it work when I read 2 parquet files with different schemas but 
both can be mapped to the same reader schema? For example, consider a schema 
evolution case, where 1 file is written by pipeline v1 and file 2 is written by 
pipeline v2 with an additional column that is ignored in the consuming Flink 
application.
   
   My PR actually solves a bug: ParquetInputFormat takes parquet schema as user 
input but after split it reads the parquet schema again here 
https://github.com/apache/flink/blob/52dcf439bb0b8d613fff1efecf015052d5b3a10b/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/ParquetInputFormat.java#L170.
 So if a user provides a read schema then, he will be surprised that the actual 
schema used after split is different.
   
   My PR touches only the reading part no change is done to the write schema.
   
   Regarding avoiding read the file to determine the schema, extracting the 
schema from the file is actually already done in 
https://github.com/apache/flink/blob/52dcf439bb0b8d613fff1efecf015052d5b3a10b/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/ParquetInputFormat.java#L170.
 
   
   
   Why is this needed to read the parquet schema from Flink split in _open()_ ? 
Can't we use the already extracted schema (extracted in the constructor) ?
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] echauchot commented on pull request #15725: [FLINK-21389] determine parquet schema from file instead of taking it from user

Reply via email to