Re: Airflow - YARN as an executor?

Ruslan Dautkhanov Wed, 25 Apr 2018 09:00:04 -0700

As long as that code is serializable (through pickle, cloudpickle or any
other Python code serializaers ),
the answer should be yes.


Thanks.



-- 
Ruslan Dautkhanov

On Wed, Apr 25, 2018 at 9:54 AM, Taylor Edmiston <[email protected]>
wrote:

> Is it possible for the (hypothetical) Airflow SparkExecutor to handle
> general execution of any operator (i.e., run non-Spark code)?
>
> *Taylor Edmiston*
> Blog <http://blog.tedmiston.com> | Stack Overflow CV
> <https://stackoverflow.com/story/taylor> | LinkedIn
> <https://www.linkedin.com/in/tedmiston/> | AngelList
> <https://angel.co/taylor>
>
>
> On Wed, Apr 25, 2018 at 11:22 AM, Ruslan Dautkhanov <[email protected]>
> wrote:
>
> > I used "Executor" as an Airflow term, not meant spark executor ...
> > Like Spark would be one of Executors
> > in here
> > https://github.com/apache/incubator-airflow/tree/master/
> airflow/executors
> > or in here
> > https://github.com/apache/incubator-airflow/tree/master/
> > airflow/contrib/executors
> >
> > Thanks.
> >
> >
> >
> > --
> > Ruslan Dautkhanov
> >
> > On Wed, Apr 25, 2018 at 9:17 AM, Bolke de Bruin <[email protected]>
> wrote:
> >
> > > Im a bit lost on the spark executor to be honest. To my knowledge the
> > > spark driver creates spark executors which run spark code. In other
> words
> > > in can’t arbitrarily run generic code. Or can it?
> > >
> > > B.
> > >
> > > Verstuurd vanaf mijn iPad
> > >
> > > > Op 25 apr. 2018 om 17:11 heeft Ruslan Dautkhanov <
> [email protected]
> > >
> > > het volgende geschreven:
> > > >
> > > > Now I think if Airflow on PySpark Executor would be an easier target.
> > > > Spark runs on YARN, Mesos and now Kubernetes.
> > > > So PySpark Executor would give Airflow porting to these schedulers.
> > > > It's my understanding we now have only Spark Operator and not
> Executor.
> > > >
> > > > Thanks!
> > > >
> > > >
> > > >
> > > > --
> > > > Ruslan Dautkhanov
> > > >
> > > >> On Tue, Apr 24, 2018 at 3:20 PM, Ace Haidrey <[email protected]>
> > > wrote:
> > > >>
> > > >> Hey I didn’t know this Bolke, I was under the impression of the same
> > as
> > > >> Ruslan.
> > > >> Thanks for the share
> > > >>
> > > >> Sent from my iPhone
> > > >>
> > > >>> On Apr 24, 2018, at 2:12 PM, Bolke de Bruin <[email protected]>
> > wrote:
> > > >>>
> > > >>> It actually can nowadays: https://cdn.oreillystatic.com/
> > > >> en/assets/1/event/269/HDFS%20on%20Kubernetes_%20Tech%
> > > >> 20deep%20dive%20on%20locality%20and%20security%20Presentation.pptx
> > > >>>
> > > >>> We also have an on premise setup with ceph (s3a) and HDFS for when
> we
> > > >> need the speed and kubernetes for our workloads. We are kicking out
> > Yarn
> > > >> (and hive etc for that matter).
> > > >>>
> > > >>> Bolke
> > > >>>
> > > >>>
> > > >>>
> > > >>> Verstuurd vanaf mijn iPad
> > > >>>
> > > >>>> Op 24 apr. 2018 om 22:50 heeft Ruslan Dautkhanov <
> > > [email protected]>
> > > >> het volgende geschreven:
> > > >>>>
> > > >>>> Kubernetes is a "monolithic" 1-level scheduler that can't handle
> > what
> > > >> YARN
> > > >>>> can - for example schedule tasks local to data.
> > > >>>> Hadoop has multiple levels of data locality (node-local,
> > rack-local) -
> > > >> so
> > > >>>> computation happens local to data to minimize network
> > > >>>> data transfer which is expensive.
> > > >>>> K8s wasn't designed to handle this scheduling scenarios, as far
> as I
> > > >> know.
> > > >>>>
> > > >>>> For cloud deployments where we don't have data locality problem
> > > >> (because of
> > > >>>> s3 is being used instead of storage local
> > > >>>> to servers), k8s might be okay.
> > > >>>>
> > > >>>> Nice comparison [1] of k8s vs two-level schedulers like yarn and
> > > messos
> > > >> ..
> > > >>>> although I think it's an offtopic.
> > > >>>>
> > > >>>> We're mostly on-prem and we don't see kubernetes take over yarn
> any
> > > time
> > > >>>> soon.
> > > >>>>
> > > >>>> Thanks.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> [1]
> > > >>>>
> > > >>>> https://aaltodoc.aalto.fi/bitstream/handle/123456789/
> > > >> 27061/master_Ravula_Shashi_2017.pdf?sequence=1
> > > >>>>
> > > >>>> *2.3.2 Monolithic Schedulers *
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> Monolithic schedulers use a single, centralized scheduling
> algorithm
> > > for
> > > >>>> all jobs. All workload is run through the same scheduler and same
> > > >>>> scheduling logic. Swarm,
> > > >>>> Fleet, Borg and Kubernetes adopt monolithic schedulers. Kubernetes
> > > >>>> improvised on basic monolithic version of Borg and Swarm
> schedulers.
> > > >> This
> > > >>>> type of schedulers are not suitable for running heterogeneous
> modern
> > > >>>> workloads which include Spark jobs, containers, and other long
> > running
> > > >> jobs,
> > > >>>> etc.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> *2.3.3 Two Level Schedulers *
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> Two-level schedulers address the drawbacks of a monolithic
> scheduler
> > > by
> > > >>>> separating concerns of resource allocation and task placement. An
> > > active
> > > >>>> resource manager offers compute resources to multiple parallel,
> > > >> independent
> > > >>>> “scheduler frameworks”. The Mesos cluster manager pioneered this
> > > >> approach,
> > > >>>> and YARN supports a limited version of it. In Mesos, resources are
> > > >> offered
> > > >>>> to application-level schedulers. This allows for custom,
> > > >> workload-specific
> > > >>>> scheduling policies. The drawback with this type of scheduling
> > > >> architecture
> > > >>>> is that the application level frameworks cannot see all the
> possible
> > > >>>> placement options anymore. Instead, they only see those options
> that
> > > >>>> correspond to resources offered (Mesos) or allocated (YARN) by the
> > > >> resource
> > > >>>> manager component. This makes priority preemption (higher priority
> > > tasks
> > > >>>> kick out lower priority ones) difficult.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> --
> > > >>>> Ruslan Dautkhanov
> > > >>>>
> > > >>>>> On Tue, Apr 24, 2018 at 2:22 PM, Bolke de Bruin <
> [email protected]
> > >
> > > >> wrote:
> > > >>>>>
> > > >>>>> Happy to have it as a contrib executor. However, I personally
> think
> > > >> yarn
> > > >>>>> is a dead end. It has a lot of catching up to do and all the
> > momentum
> > > >> is
> > > >>>>> with kubernetes.
> > > >>>>>
> > > >>>>> B.
> > > >>>>>
> > > >>>>> Verstuurd vanaf mijn iPad
> > > >>>>>
> > > >>>>>> Op 24 apr. 2018 om 22:13 heeft Ruslan Dautkhanov <
> > > >> [email protected]>
> > > >>>>> het volgende geschreven:
> > > >>>>>>
> > > >>>>>> With Hadoop 3's Docker on YARN support, I think YARN becomes
> > > >>>>>> somewhat a competitor for Kubernetes.
> > > >>>>>>
> > > >>>>>> Great job on adding k8s support to Airflow.
> > > >>>>>>
> > > >>>>>> Very similarly I see Airflow could integrate with YARN and use
> > > >>>>>> its infrastructure as an "executor" .. have anyone explored
> > > >> feasibility
> > > >>>>> of
> > > >>>>>> this approach?
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> Thanks!
> > > >>>>>> Ruslan Dautkhanov
> > > >>>>>
> > > >>
> > >
> >
>

Re: Airflow - YARN as an executor?

Reply via email to