Hi Daniel,

we will definitely try this out. We are still on 1.10.2, so we will do the
upgrade and see how it goes.

Thanks
Kamil

On Tue, Apr 16, 2019 at 9:17 PM Daniel Imberman <
[email protected]> wrote:

> Hi Kamil,
>
> So if it's airflow that's the issue then the PR I posted should have solved
> it. Have you upgraded to 1.10.3 and added the
> worker_pods_creation_batch_size variable to your airflow.cfg? This should
> allow multiple pods to be launched in parallel.
>
> Also unfortunately the screenshot appears to be broken. Could you please
> upload it as an imgur link?
>
> On Tue, Apr 16, 2019 at 10:50 AM Kamil Gałuszka <[email protected]>
> wrote:
>
> > Hi Daniel,
> >
> > It's airflow.
> >
> > This is DAG that we could show. Of course, we can change this to
> > KubernetesPodOperator and this get's even worse.
> >
> > ```
> > from airflow import DAG
> > from datetime import datetime, timedelta
> > from airflow.operators.dummy_operator import DummyOperator
> > from airflow.operators.bash_operator import BashOperator
> >
> > default_args = {
> >     'owner': 'airflow',
> >     'depends_on_past': False,
> >     'start_date': datetime(2019, 4, 10, 23),
> >     'email': ['[email protected]'],
> >     'email_on_failure': False,
> >     'email_on_retry': False,
> >     'retries': 5,
> >     'retry_delay': timedelta(minutes=2)
> > }
> >
> > SLEEP_DURATION = 300
> >
> > dag = DAG(
> >     'wide_bash_test_100_300', default_args=default_args,
> > schedule_interval=timedelta(minutes=60), max_active_runs=1,
> concurrency=100)
> >
> > start = DummyOperator(task_id='dummy_start_point', dag=dag)
> > end = DummyOperator(task_id='dummy_end_point', dag=dag)
> >
> > for i in range(300):
> >
> >     sleeper_agent = BashOperator(task_id=f'sleep_{i}',
> >                                  bash_command=f'sleep {SLEEP_DURATION}',
> >                                  dag=dag)
> >
> >     start >> sleeper_agent >> end
> > ```
> >
> > And here are some stuff that happens after, some tasks failed, some are
> > retrying, and definietly we don't have 100 concurrency on it. We have
> > autoscaling of nodes in GKE, so every pod after some time should move
> from
> > Pending to Running. With 300 concurrency this gets little worse.
> >
> > [image: Screen Shot 2019-04-16 at 10.30.18 AM (1).png]
> >
> > Thanks
> > Kamil
> >
> > On Tue, Apr 16, 2019 at 6:40 PM Daniel Imberman <
> > [email protected]> wrote:
> >
> >> Hi Kamil,
> >>
> >> Could you explain your use-case a little further? Is it that your k8s
> >> cluster runs into issues launching 250 tasks at the same time or that
> >> airflow runs into issues launching 250 tasks at the same time? I'd love
> to
> >> know more so I could try to address it in a future airflow release.
> >>
> >> Thanks!
> >>
> >> Daniel
> >>
> >> On Tue, Apr 16, 2019 at 3:32 AM Kamil Gałuszka <[email protected]>
> >> wrote:
> >>
> >> > Hey,
> >> >
> >> > We are quite interested in that Executor too but my main concern isn't
> >> it a
> >> > > waste of resource to start a whole pod to run thing like
> DummyOperator
> >> > for
> >> > > example ? We have a cap of 200 tasks at any given time and we
> >> regularly
> >> > hit
> >> > > this cap, we cope with that with 20 celery workers but with the
> >> > > KubernetesExecutor that would mean 200 pods, does it really scale
> that
> >> > > easily ?
> >> > >
> >> >
> >> > Unfortunately no.
> >> >
> >> > We are now having problem of having a DAG with 300 tasks in DAG that
> >> should
> >> > start parallel at once, and there is only about 140 task instances
> >> started.
> >> > Setting parallelism to 256 didn't help and system struggles to get the
> >> > numbers up that high for running tasks.
> >> >
> >> > The biggest problem that we have now, is to find bottleneck in
> >> scheduler,
> >> > but it's taking time to debug it.
> >> >
> >> > We will definitely be investigating that further and share findings
> but
> >> as
> >> > for now, I wouldn't say it's "non-problematic" as some other people
> >> stated.
> >> >
> >> > Thanks
> >> > Kamil
> >> >
> >>
> >
>

Reply via email to