echauchot commented on pull request #17501:
URL: https://github.com/apache/flink/pull/17501#issuecomment-960615936


   > > > > > @echauchot
   > > > > > > Sure, I saw the comments about split and data types etc... But I 
feel unconfortable about draft PRs because they usually cannot be merged as is. 
In the case of your PR, merging it without the split support could not be done. 
So I guess the correct way to proceed is to use this PR as an environment for 
design discussions and add extra commits until the PR is ready for prod @JingGe 
@fapaul WDYT ?
   > > > > > 
   > > > > > 
   > > > > > you are right, that was the idea of the draft PR. Speaking of the 
splitting support specifically, which will make the implementation way more 
complicated, this PR might be merged without it, because we didn't get any 
requirement for it from the business side. If you have any strong requirement 
w.r.t. the splitting, we'd like to know and reconsider it.
   > > > > 
   > > > > 
   > > > > I think splitting is mandatory because if you read a big parquet 
file with no split support, then all the content will end up in a single task 
manager which will lead to OOM
   > > > 
   > > > 
   > > > I agree with you, I have actually the same concern, especially from 
the SQL perspective. I didn't really understand your concern about OOM, because 
on the upper side we can control it via `StreamFormat.FETCH_IO_SIZ` and on the 
under side, `ParquetFileReader` will be used, we will not read the whole 
parquet file into memory.
   > > > The goal of this PR is to make everyone on the same page w.r.t. the 
implementation design. Once the design is settled down, the splitting support 
as a feature could be easily done in a follow-up PR. That is why I wrote in the 
PR description explicitly at the beginning that "Splitting is not supported in 
this version". I will update it with more background info.
   > > 
   > > 
   > > Yes I see that there is a countermeasure regarding possible OOM (fetch 
size) but still, for performance reasons, the split is important. Otherwise the 
parallelism is sub-optimal and Flink focuses on performance. I'm not a 
committer on the Flink project so it is not my decision to merge this PR 
without split but I would tend not to merge without split support to avoid that 
a user suffers from this lack of performance which seems to not meet project 
quality standards.
   > > @AHeise WDYT ?
   > 
   > I'm fine with a follow-up ticket/PR on that one to keep things going. 
Having any support for AvroParquet is better than having none. But it should be 
done before 1.15 release anyhow, such that end-users see only the splittable 
version.
   > 
   > We plan to support splitting for all formats with sync marks but in 
general the role of splitting has shifted, since the whole big data processing 
moved from block-based storages to cloud storages. Earlier, splitting was also 
needed to support data locality, which doesn't apply anymore. So now it's only 
needed to speed up ingestion (you can always rebalance after source), so it is 
necessary only for the most basic pipelines.
   > 
   > TL;DR while splitting is still a should-have feature, the days of 
must-have are gone.
   
   Having the split before any user could see the non-split version was all I 
was interested in. So delivering it before 1.15 release looks perfect ! 
   Thanks a lot Arvid for the info I did not have about the relative importance 
of splitting, I had only the rebalance in mind !
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to