echauchot commented on pull request #15725: URL: https://github.com/apache/flink/pull/15725#issuecomment-857855600
> Hi @echauchot , > > sorry for the delay, I was busy guiding some FLIPs. > > I have checked the PR but it's like I pointed out before: > > * Your change is effectively removing reader schema. In the current state, you'd always receive reader=writer schema, which is not good in many cases. > * We can and should make reader schema optional. All new constructors are here to stay and ease using it, if you want to read all columns. > * Reading the writer schema currently happens in `#open` and it should stay like this. Quite often, the client which submits the application doesn't have access to the data it points to (think of GDPR compliance). So it's better to do it in `#open`, which is only called on task managers. > * Since the writer schema is already read in `#open`, you could populate the fields `fieldNames` and `fieldTypes` there in case of no explicit reader schema. Thanks for the precision about reader schema and writer schema. That makes sense to allow to read using a different schema that the one that was used when writing the file. But still, there is a thing I don't get: what is the point of providing a schema in _ParquetInputFormat_ constructor, it is not the one that will be used **at the actual reading time**, it is the writer schema (the one extracted in _#open_) that will be used. Hence the fact that I qualified the ticket as a bug with current flink master state. I agree IO operations should only happen en task managers as the job manager might not have access to data (network filtering or other) and for parallelization also I guess. When no schema is provided, of course, we can set the fields with values read in _#open_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
