[GitHub] [flink] echauchot commented on pull request #15725: [FLINK-21389] determine parquet schema from file instead of taking it from user

GitBox Wed, 09 Jun 2021 09:37:08 -0700


echauchot commented on pull request #15725:
URL: https://github.com/apache/flink/pull/15725#issuecomment-857855600



   > Hi @echauchot ,
   > 
   > sorry for the delay, I was busy guiding some FLIPs.
   > 
   > I have checked the PR but it's like I pointed out before:
   > 
   > * Your change is effectively removing reader schema. In the current state, 
you'd always receive reader=writer schema, which is not good in many cases.
   > * We can and should make reader schema optional. All new constructors are 
here to stay and ease using it, if you want to read all columns.
   > * Reading the writer schema currently happens in `#open` and it should 
stay like this. Quite often, the client which submits the application doesn't 
have access to the data it points to (think of GDPR compliance). So it's better 
to do it in `#open`, which is only called on task managers.
   > * Since the writer schema is already read in `#open`, you could populate 
the fields `fieldNames` and `fieldTypes` there in case of no explicit reader 
schema.
   
   Thanks for the precision about reader schema and writer schema. That makes 
sense to allow to read using a different schema that the one that was used when 
writing the file. But still, there is a thing I don't get: what is the point of 
providing a schema in _ParquetInputFormat_ constructor, it is not the one that 
will be used **at the actual reading time**, it is the writer schema (the one 
extracted in _#open_) that will be used. Hence the fact that I qualified the 
ticket as a bug with current flink master state.
   
   I agree IO operations should only happen en task managers as the job manager 
might not have access to data (network filtering or other) and for 
parallelization also I guess.
   
   When no schema is provided,  of course, we can set the fields with values 
read in _#open_
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] echauchot commented on pull request #15725: [FLINK-21389] determine parquet schema from file instead of taking it from user

Reply via email to