[GitHub] [flink] AHeise commented on pull request #17501: [Draft][FLINK-21406][RecordFormat] build AvroParquetRecordFormat for the new FileSource

GitBox Tue, 02 Nov 2021 02:21:33 -0700


AHeise commented on pull request #17501:
URL: https://github.com/apache/flink/pull/17501#issuecomment-957254354



   > > > > @echauchot
   > > > > > Sure, I saw the comments about split and data types etc... But I 
feel unconfortable about draft PRs because they usually cannot be merged as is. 
In the case of your PR, merging it without the split support could not be done. 
So I guess the correct way to proceed is to use this PR as an environment for 
design discussions and add extra commits until the PR is ready for prod @JingGe 
@fapaul WDYT ?
   > > > > 
   > > > > 
   > > > > you are right, that was the idea of the draft PR. Speaking of the 
splitting support specifically, which will make the implementation way more 
complicated, this PR might be merged without it, because we didn't get any 
requirement for it from the business side. If you have any strong requirement 
w.r.t. the splitting, we'd like to know and reconsider it.
   > > > 
   > > > 
   > > > I think splitting is mandatory because if you read a big parquet file 
with no split support, then all the content will end up in a single task 
manager which will lead to OOM
   > > 
   > > 
   > > I agree with you, I have actually the same concern, especially from the 
SQL perspective. I didn't really understand your concern about OOM, because on 
the upper side we can control it via `StreamFormat.FETCH_IO_SIZ` and on the 
under side, `ParquetFileReader` will be used, we will not read the whole 
parquet file into memory.
   > > The goal of this PR is to make everyone on the same page w.r.t. the 
implementation design. Once the design is settled down, the splitting support 
as a feature could be easily done in a follow-up PR. That is why I wrote in the 
PR description explicitly at the beginning that "Splitting is not supported in 
this version". I will update it with more background info.
   > 
   > Yes I see that there is a countermeasure regarding possible OOM (fetch 
size) but still, for performance reasons, the split is important. Otherwise the 
parallelism is sub-optimal and Flink focuses on performance. I'm not a 
committer on the Flink project so it is not my decision to merge this PR 
without split but I would tend not to merge without split support to avoid that 
a user suffers from this lack of performance which seems to not meet project 
quality standards.
   > 
   > @AHeise WDYT ?
   
   I'm fine with a follow-up ticket/PR on that one to keep things going. Having 
any support for AvroParquet is better than having none. But it should be done 
before 1.15 release anyhow, such that end-users see only the splittable version.
   
   We plan to support splitting for all formats with sync marks but in general 
the role of splitting has shifted, since the whole big data processing moved 
from block-based storages to cloud storages. Earlier, splitting was also needed 
to support data locality, which doesn't apply anymore. So now it's only needed 
to speed up ingestion (you can always rebalance after source), so it is 
necessary only for the most basic pipelines.
   
   TL;DR while splitting is still a should-have feature, the days of must-have 
are gone.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] AHeise commented on pull request #17501: [Draft][FLINK-21406][RecordFormat] build AvroParquetRecordFormat for the new FileSource

Reply via email to