Re: [Discussion] Session Clusters Support Heterogeneous Task Manager Images

Ryan van Huuksloot Mon, 09 Dec 2024 18:51:15 -0800

Hello,

Sorry for the delay.


I agree I think that works for most workflows. The only caveat would be
CUDA based ML workflows. You can't bundle CUDA into a dependency bundle.
Overall, it works in application mode. It would just be awesome to use
Session clusters for Batch / ephemeral test streaming jobs.

Ryan van Huuksloot
Sr. Production Engineer | Streaming Platform
[image: Shopify]
<https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email>


On Thu, Dec 5, 2024 at 2:56 AM Dian Fu <dian0511...@gmail.com> wrote:

> Hi Ryan,
>
> It supports configuring the Python dependencies at job wise in PyFlink
> and so per my understanding, "dynamically provide dependencies in
> Python" should already be supported. Besides, it also supports
> specifying Python dependencies which are located in distributed file
> systems. It would be a good way to manage the Python dependencies in
> distributed file systems and each job could choose & configure which
> Python dependencies to use.
>
> Regards,
> Dian
>
> On Thu, Dec 5, 2024 at 3:28 PM Shengkai Fang <fskm...@gmail.com> wrote:
> >
> > Hi Ryan.
> >
> > Thanks for your inputs. I think it's better to load user python
> > dependencies dynamically rather than use different images because image
> is
> > not flexible, because using image is hard to test:
> > * we need to build an image and push the image to docker hub for
> testing...
> > * it takes a lot of time to build images...
> >
> > Best,
> > Shengkai
> >
> >
> > Ryan van Huuksloot <ryan.vanhuuksl...@shopify.com.invalid> 于2024年12月5日周四
> > 12:46写道：
> >
> > > Hi Shengkai,
> > >
> > > re: (1)
> > > That is how we currently handle image management.
> > >
> > > re: (2)
> > > The current proposed use case is that MLEs provide different PyFlink
> jobs
> > > which can have different dependencies/version requirements and these
> > > packages can be quite large (GBs).
> > > In the Java world, you'd provide a different uber jar with the
> dependencies
> > > and that should work. In Python, as far as I know, you can't provide
> the
> > > same bundled dependencies.
> > > This means that we need to preload the image with all of the
> dependencies
> > > but those dependencies would be static based on the pre-defined image.
> And
> > > different workloads on this session cluster may require different
> > > dependencies / versions.
> > >
> > > Maybe it is simpler to provide a way to dynamically provide
> dependencies in
> > > Python - similar to Java?
> > >
> > > (I haven't use the jar submission in Java)
> > >
> > > Thanks,
> > > Ryan van Huuksloot
> > > Sr. Production Engineer | Streaming Platform
> > > [image: Shopify]
> > > <
> https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email>
> > >
> > >
> > > On Tue, Dec 3, 2024 at 9:11 PM Shengkai Fang <fskm...@gmail.com>
> wrote:
> > >
> > > > Hi Ryan.
> > > >
> > > > Thanks for your input. I am not a k8s expert, but I know that Flink
> k8s
> > > > deployments supports to get Flink TaskManager with specified pod
> > > > template[1], which supports to specify image. @Junrui may provide
> more
> > > > detailed information about this topic.
> > > >
> > > > If different taskmanager has different workload, it means the slot
> in the
> > > > different taskamanger has different profiles. Otherwise, scheduler
> > > doesn't
> > > > know the difference among different slots and may choose the wrong
> slot
> > > to
> > > > run the task. I am just curious what's the difference between the
> ETL job
> > > > and ML job.
> > > >
> > > > Best,
> > > > Shengkai
> > > >
> > > > [1]
> > > >
> > > >
> > >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#pod-template
> > > >
> > > > Ryan van Huuksloot <ryan.vanhuuksl...@shopify.com.invalid>
> 于2024年12月3日周二
> > > > 22:11写道：
> > > >
> > > > > Hi Shengkai,
> > > > >
> > > > > Today we currently use application mode. It is an option and may
> be the
> > > > > recommendation.
> > > > >
> > > > > Specifically for Batch jobs, we have Machine Learning pipelines
> that
> > > are
> > > > > ephemeral however they contain very different dependencies
> depending on
> > > > the
> > > > > workload.
> > > > > From my perspective, Batch jobs work well on Session Clusters.
> However,
> > > > due
> > > > > to the differing images you cannot run different workloads on the
> same
> > > > > session cluster. Making the session cluster essentially useless.
> > > > >
> > > > > Ryan van Huuksloot
> > > > > Sr. Production Engineer | Streaming Platform
> > > > > [image: Shopify]
> > > > > <
> > >
> https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Dec 3, 2024 at 1:20 AM Shengkai Fang <fskm...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi.
> > > > > >
> > > > > > Why needs different image for taskmanager? Do you mean different
> > > > > operators
> > > > > > require different resources?
> > > > > >
> > > > > > As far as I know, JM supports to manage taskmanager with
> different
> > > > > > profiles. For example, a cluster may consists of two taskmanagers
> > > with
> > > > > > following profiles:
> > > > > > * TM1 contains 4 slots, every slot has 2 core, 4GB Memory
> > > > > > * TM2 contains 4 slots, every slot have 1core, 2GB Memory
> > > > > >
> > > > > > > the scheduler would need some level of job isolation
> > > > > >
> > > > > > You can use application mode to run the job. In application
> mode, the
> > > > > > cluster is dedicated for the job.
> > > > > >
> > > > > > Best,
> > > > > > Shengkai
> > > > > >
> > > > >
> > > >
> > >
>

Re: [Discussion] Session Clusters Support Heterogeneous Task Manager Images

Reply via email to