JingGe commented on pull request #17501:
URL: https://github.com/apache/flink/pull/17501#issuecomment-952861699


   > > @echauchot
   > > > Sure, I saw the comments about split and data types etc... But I feel 
unconfortable about draft PRs because they usually cannot be merged as is. In 
the case of your PR, merging it without the split support could not be done. So 
I guess the correct way to proceed is to use this PR as an environment for 
design discussions and add extra commits until the PR is ready for prod @JingGe 
@fapaul WDYT ?
   > > 
   > > 
   > > you are right, that was the idea of the draft PR. Speaking of the 
splitting support specifically, which will make the implementation way more 
complicated, this PR might be merged without it, because we didn't get any 
requirement for it from the business side. If you have any strong requirement 
w.r.t. the splitting, we'd like to know and reconsider it.
   > 
   > I think splitting is mandatory because if you read a big parquet file with 
no split support, then all the content will end up in a single task manager 
which will lead to OOM
   
   I agree with you, I have actually the same concern, especially from the SQL 
perspective. I didn't really understand your concern about OOM, because on the 
upper side we can control it via `StreamFormat.FETCH_IO_SIZ` and on the under 
side, `ParquetFileReader` will be used, we will not read the whole parquet file 
into memory. 
   
   The goal of this PR is to make everyone on the same page w.r.t. the 
implementation design. Once the design is settled down, the splitting support 
as a feature could be easily done in a follow-up PR. That is why I wrote in the 
PR description explicitly at the beginning that "Splitting is not supported in 
this version". I will update it with more background info.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to