Re: [blog article] Howto create a batch source with the new Source framework

Etienne Chauchot Fri, 31 Mar 2023 06:08:45 -0700

Hi Yuxia,

Thanks for your feedback.


Comments inline


Le 31/03/2023 à 04:21, yuxia a écrit :

Hi, Etienne.

Thanks for Etienne for sharing this article. I really like it and learn much 
from it.

=> Glad it was useful, that was precisely the point :)


I'd like to raise some questions about implementing batch source. Welcome devs 
to share insights about them.

The first question is how to generate splits:
As the article mentioned:
"Whenever possible, it is preferable to generate the splits lazily, meaning that 
each time a reader asks the enumerator for a split, the enumerator generates one on 
demand and assigns it to the reader."
I think it maybe not for all cases. In some cases, generating split may be time 
counsuming, then it may be better to generate a batch of splits on demand to 
amortize the expense.
But it then raises another question, how many splits should be generated in a 
batch, too many maywell cause OOM, too less may not make good use of batch 
generating splits.
To solve it, I think maybe we can provide a configuration to make user to 
configure how many splits should be generated in a batch.
What's your opinion on it. Have you ever encountered this problem in your 
implementation?

=> I agree, lazy splits is not the only way. I've mentioned in thearticle that batch generation is another in case of high splitgeneration cost, thanks for the suggestion. During the implementation Ididn't have this problem as generating a split was not costly, the onlycostly processing was the splits preparation. It was run asynchronouslyand only once, then each split generation was straightforward. Thatbeing said, during development, I had OOM risks in the size and numberof splits. For the number of splits, lazy generation solved it as nolist of splits was stored in the ennumerator apart from the splits toreassign. For the size of split I used a user provided max split memorysize similar to what you suggest here. In the batch generation case, wecould allow the user to set a max memory size for the batch : number ofsplits in batch looks more dangerous to me if we don't know the size ofa split but if we are talking about storing the split objects and nottheir content then that is ok. IMO, memory size is more clear for theuser as it is linked to the memory of a task manager.



The second question is how to assign splits:
What's your split assign stratgy?

=> the naïve one: a reader asks for a split, the enumerator receives therequest, generates a split and assigns it to the demanding reader.

In flink, we provide `LocalityAwareSplitAssigner` to make use of locality to 
assign split to reader.

=> well, it has interest only when the data-backend cluster nodes can beco-localized with Flink task managers right? That would rarely be thecase as clusters seem to be separated most of the time to use themaximum available CPU (at least for CPU-band workloads) no ?

But it may not perfert for the case of failover

=> Agree: it would require costly shuffle to keep the co-location afterrestoration and this cost would not be balanced by the gain raised byco-locality (mainly avoiding network use) I think.

for which we intend to introduce another split assign strategy[1].
But I do think it should be configurable to enable advanced user to decide 
which assign stratgy to use.

=> when you say the "user" I guess you mean user of the source not userof the dev framework (implementor of the source). I think that it shouldbe configurable indeed as the user is the one knowing the repartition ofthe partitions of the backend data.


Best

Etienne



Welcome other devs to share opinion.

[1]: https://issues.apache.org/jira/browse/FLINK-31065





Also as for split assigner .


Best regards,
Yuxia

----- 原始邮件 -----
发件人: "Etienne Chauchot" <echauc...@apache.org>
收件人: "dev" <dev@flink.apache.org>
抄送: "Chesnay Schepler" <ches...@apache.org>
发送时间: 星期四, 2023年 3 月 30日 下午 10:36:39
主题: [blog article] Howto create a batch source with the new Source framework

Hi all,

After creating the Cassandra source connector (thanks Chesnay for the
review!), I wrote a blog article about how to create a batch source with
the new Source framework [1]. It gives field feedback on how to
implement the different components.

I felt it could be useful to people interested in contributing or
migrating connectors.

=> Can you give me your opinion ?

=> I think it could be useful to post the article to Flink official blog
also if you agree.

=> Same remark on my previous article [2]: what about publishing it to
Flink official blog ?


[1]https://echauchot.blogspot.com/2023/03/flink-howto-create-batch-source-with.html

[2]https://echauchot.blogspot.com/2022/11/flink-howto-migrate-real-life-batch.html


Best

Etienne

Re: [blog article] Howto create a batch source with the new Source framework

Reply via email to