Hi Yuxia,

Thanks for your feedback.

Comments inline


Le 31/03/2023 à 04:21, yuxia a écrit :
Hi, Etienne.

Thanks for Etienne for sharing this article. I really like it and learn much 
from it.
=> Glad it was useful, that was precisely the point :)

I'd like to raise some questions about implementing batch source. Welcome devs 
to share insights about them.

The first question is how to generate splits:
As the article mentioned:
"Whenever possible, it is preferable to generate the splits lazily, meaning that 
each time a reader asks the enumerator for a split, the enumerator generates one on 
demand and assigns it to the reader."
I think it maybe not for all cases. In some cases, generating split may be time 
counsuming, then it may be better to generate a batch of splits on demand to 
amortize the expense.
But it then raises another question, how many splits should be generated in a 
batch, too many maywell cause OOM, too less may not make good use of batch 
generating splits.
To solve it, I think maybe we can provide a configuration to make user to 
configure how many splits should be generated in a batch.
What's your opinion on it. Have you ever encountered this problem in your 
implementation?

=> I agree, lazy splits is not the only way. I've mentioned in the article that batch generation is another in case of high split generation cost, thanks for the suggestion. During the implementation I didn't have this problem as generating a split was not costly, the only costly processing was the splits preparation. It was run asynchronously and only once, then each split generation was straightforward. That being said, during development, I had OOM risks in the size and number of splits. For the number of splits, lazy generation solved it as no list of splits was stored in the ennumerator apart from the splits to reassign. For the size of split I used a user provided max split memory size similar to what you suggest here. In the batch generation case, we could allow the user to set a max memory size for the batch : number of splits in batch looks more dangerous to me if we don't know the size of a split but if we are talking about storing the split objects and not their content then that is ok. IMO, memory size is more clear for the user as it is linked to the memory of a task manager.



The second question is how to assign splits:
What's your split assign stratgy?
=> the naïve one: a reader asks for a split, the enumerator receives the request, generates a split and assigns it to the demanding reader.
In flink, we provide `LocalityAwareSplitAssigner` to make use of locality to 
assign split to reader.
=> well, it has interest only when the data-backend cluster nodes can be co-localized with Flink task managers right? That would rarely be the case as clusters seem to be separated most of the time to use the maximum available CPU (at least for CPU-band workloads) no ?
But it may not perfert for the case of failover
=> Agree: it would require costly shuffle to keep the co-location after restoration and this cost would not be balanced by the gain raised by co-locality (mainly avoiding network use) I think.
for which we intend to introduce another split assign strategy[1].
But I do think it should be configurable to enable advanced user to decide 
which assign stratgy to use.

=> when you say the "user" I guess you mean user of the source not user of the dev framework (implementor of the source). I think that it should be configurable indeed as the user is the one knowing the repartition of the partitions of the backend data.

Best

Etienne



Welcome other devs to share opinion.

[1]: https://issues.apache.org/jira/browse/FLINK-31065





Also as for split assigner .


Best regards,
Yuxia

----- 原始邮件 -----
发件人: "Etienne Chauchot" <echauc...@apache.org>
收件人: "dev" <dev@flink.apache.org>
抄送: "Chesnay Schepler" <ches...@apache.org>
发送时间: 星期四, 2023年 3 月 30日 下午 10:36:39
主题: [blog article] Howto create a batch source with the new Source framework

Hi all,

After creating the Cassandra source connector (thanks Chesnay for the
review!), I wrote a blog article about how to create a batch source with
the new Source framework [1]. It gives field feedback on how to
implement the different components.

I felt it could be useful to people interested in contributing or
migrating connectors.

=> Can you give me your opinion ?

=> I think it could be useful to post the article to Flink official blog
also if you agree.

=> Same remark on my previous article [2]: what about publishing it to
Flink official blog ?


[1]https://echauchot.blogspot.com/2023/03/flink-howto-create-batch-source-with.html

[2]https://echauchot.blogspot.com/2022/11/flink-howto-migrate-real-life-batch.html


Best

Etienne

Reply via email to