Hi Weiwei, thanks for sharing your past experience! This is a helpful
discussion.

We should set up some dedicated discussions and topic threads for
"Streaming with Apache YuniKorn". I know a lot of folks from the industry
would be interested. This would be a great opportunity to expand YuniKorn's
footprints to more use case scenarios.

In our next Apache Flink meetup, I could help to invite some speakers
(please feel free to recommend any) and organize a roundtable for
streaming-specific discussions so folks could share their experience/needs
to identify any gaps for future improvement together.

Please let me know what you think. +devs

Best,
Chenya



On Wed, Jan 5, 2022 at 9:52 AM Weiwei Yang <[email protected]> wrote:

> hi Chenya
>
> > As we know, streaming applications are long-running and need to secure
> all
> requested resources before starting to run. In most cases, they do not have
> a strong need to be queued, ordered, or preempted to wait to obtain or give
> back their resource.
>
> You are right if the assumption is pure streaming cases, all long-running
> jobs, and the cluster has sufficient resources for all jobs. Maybe it is
> fair to say it is not a day 1 challenge.
> However, in my past experience, this is not always enough and will not be
> enough. When we operate large-scale Flink jobs, the major issues we were
> dealing with: resource utilization, resource contention, hot-spot,
> isolation, etc. We used to have tens of queues per cluster and shared by
> many users, and jobs have different priorities and high-priority jobs can
> make room by preempting lower priority ones. We have a customized
> node-score system in order to distribute pods more efficiently. As you see,
> resource queues, app-sorting, node-sorting, preemption, all play a role
> here. Also central job management, scheduling latency/throughput are also
> important.
>
> On K8s and Cloud, it brings more challenges. I guess one thing challenging
> and also interesting is how to do auto-scaling more efficiently. Sometimes
> we need a strategy to warm up resources on Cloud in order to fit new jobs
> in low latency. Most likely the scheduler can give some hints for that.
> This will be a fun part to explore too. With all being said, I do think a
> customized scheduler (instead of the pod-level scheduler -
> default-k8s-scheduler) will be necessary.
>
> On Tue, Jan 4, 2022 at 10:18 PM Chenya Zhang <[email protected]>
> wrote:
>
> > Hi Weiwei
> >
> > Thanks for sharing. I checked the video and for Alibaba's use case, they
> > have a mixed cluster for streaming and batch applications running with
> > Apache Flink. Our use case is different. We only use Apache Flink for
> > stream processing in physical clusters separate from Spark for batch
> > processing.
> >
> > As we know, streaming applications are long-running and need to secure
> all
> > requested resources before starting to run. In most cases, they do not
> have
> > a strong need to be queued, ordered, or preempted to wait to obtain or
> give
> > back their resource.
> >
> > I'm gathering more streaming use case requirements that could not be
> > satisfied by K8s namespace for resource quota management or other
> advanced
> > scheduling needs. Will keep this thread updated.
> >
> > Meanwhile, happy to hear more thoughts from you!
> >
> > Best,
> > Chenya
> >
> > On Tue, Jan 4, 2022 at 9:20 PM Weiwei Yang <[email protected]> wrote:
> >
> > > Hi Chenya
> > >
> > > The use case is similar, YK will play a big role there. Lots of
> features
> > > are relevant, such as queues, job ordering, user/group ACLs,
> preemption,
> > > over-subscription, and performance etc.
> > > Some of the basic functionalities are available in YK, some more needs
> to
> > > be built.
> > > Please take a look at the slides from the Alibaba Flink team, they have
> > > shared how they use YK to address their use cases.
> > > This was presented in ApacheConf:
> > > https://www.youtube.com/watch?v=4hghJCuZk5M
> > >
> > > On Tue, Jan 4, 2022 at 6:35 PM Chenya Zhang <
> [email protected]
> > >
> > > wrote:
> > >
> > > > Hey folks,
> > > >
> > > > We have some new streaming use cases with Apache Flink that could
> > > > potentially leverage YuniKorn for resource scheduling.
> > > >
> > > > The initial implementation is to use K8s namespace for resource quota
> > > > management. We are investigating what could be some strong benefits
> > > > switching to YuniKorn in streaming cases for long-running services.
> For
> > > > example: Job queueing, job ordering, resource reservation, user
> groups
> > > etc
> > > > all seem to be more desirable for batch use cases.
> > > >
> > > > Any thoughts or suggestions?
> > > >
> > > > Thanks,
> > > > Chenya
> > > >
> > >
> >
>

Reply via email to