AHeise commented on pull request #17501: URL: https://github.com/apache/flink/pull/17501#issuecomment-957254354
> > > > @echauchot > > > > > Sure, I saw the comments about split and data types etc... But I feel unconfortable about draft PRs because they usually cannot be merged as is. In the case of your PR, merging it without the split support could not be done. So I guess the correct way to proceed is to use this PR as an environment for design discussions and add extra commits until the PR is ready for prod @JingGe @fapaul WDYT ? > > > > > > > > > > > > you are right, that was the idea of the draft PR. Speaking of the splitting support specifically, which will make the implementation way more complicated, this PR might be merged without it, because we didn't get any requirement for it from the business side. If you have any strong requirement w.r.t. the splitting, we'd like to know and reconsider it. > > > > > > > > > I think splitting is mandatory because if you read a big parquet file with no split support, then all the content will end up in a single task manager which will lead to OOM > > > > > > I agree with you, I have actually the same concern, especially from the SQL perspective. I didn't really understand your concern about OOM, because on the upper side we can control it via `StreamFormat.FETCH_IO_SIZ` and on the under side, `ParquetFileReader` will be used, we will not read the whole parquet file into memory. > > The goal of this PR is to make everyone on the same page w.r.t. the implementation design. Once the design is settled down, the splitting support as a feature could be easily done in a follow-up PR. That is why I wrote in the PR description explicitly at the beginning that "Splitting is not supported in this version". I will update it with more background info. > > Yes I see that there is a countermeasure regarding possible OOM (fetch size) but still, for performance reasons, the split is important. Otherwise the parallelism is sub-optimal and Flink focuses on performance. I'm not a committer on the Flink project so it is not my decision to merge this PR without split but I would tend not to merge without split support to avoid that a user suffers from this lack of performance which seems to not meet project quality standards. > > @AHeise WDYT ? I'm fine with a follow-up ticket/PR on that one to keep things going. Having any support for AvroParquet is better than having none. But it should be done before 1.15 release anyhow, such that end-users see only the splittable version. We plan to support splitting for all formats with sync marks but in general the role of splitting has shifted, since the whole big data processing moved from block-based storages to cloud storages. Earlier, splitting was also needed to support data locality, which doesn't apply anymore. So now it's only needed to speed up ingestion (you can always rebalance after source), so it is necessary only for the most basic pipelines. TL;DR while splitting is still a should-have feature, the days of must-have are gone. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
