Is the problem with the k8s scheduler?  Are you using karpenter as well?
When this happens, nodes are not scaling up the same time you are launching
pods right?

The problem of pod startup time is a common one, we could maybe take
something away from how gang scheduling works?  This is how spark solves
the problem, things like the yunicorn scheduler places pods in a pending
state to reserve slots, so you are not wasting time on scheduling.   I
think the overlord can potentially do something similar to yunikorn here.

I think this is mostly an issue with the k8s in general, for example a
spark / map-reduce job is much faster to spin up on yarn vs kubernetes,
there is no denying that, but you lose that elasticity with yarn that you
have with k8s.



On Mon, Apr 28, 2025 at 9:18 AM Gian Merlino <g...@apache.org> wrote:

> I have replied in the issue with some thoughts on the root causes of
> ingestion lag, and pointers to some recent work on one of the root causes.
> (I believe there are two roots.)
>
> Gian
>
> On 2025/04/15 09:29:01 Frank Chen wrote:
> > Hi Gian and Maytas,
> >
> > I'm writing this email to you to bring an old ingestion issue for your
> > attention to discuss.
> >
> > It's about lag while tasks are rolling, here's the link:
> > https://github.com/apache/druid/issues/11414
> > From the issue, the root cause is that tasks take several seconds to
> start
> > up during which messages can't be consumed from Kafka.
> > Giann linked some PRs in that issue which improved performance of notice
> > processing, but this didn't solve the problem completely, and the last
> > reply in this thread suggested that on Druid 27, this problem still
> exists.
> >
> > I also noticed that Maytas said in the HADOOP INGESTION SUPPORT thread
> that
> > he is going to use K8S-based ingestion to replace Middle Managers which
> > makes sense to me because it improves the resource utilization.
> > But the above lag issue might be magnified because K8S scheduling
> > introduces some extra delay, for example resource allocation at K8S side,
> > pulling image from repository, it can be seconds to do so,
> > which means K8S-based ingestion tasks generally have slower start up. So
> I
> > include Maytas in the hope that this problem has already been noticed or
> > even solved by his team.
> >
> > If you have any suggestions/ideas, please reply to the original issue,
> so
> > that all information is in place.
> >
> > Thanks and regards.
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>
>

Reply via email to