Is it possible for the (hypothetical) Airflow SparkExecutor to handle general execution of any operator (i.e., run non-Spark code)?
*Taylor Edmiston* Blog <http://blog.tedmiston.com> | Stack Overflow CV <https://stackoverflow.com/story/taylor> | LinkedIn <https://www.linkedin.com/in/tedmiston/> | AngelList <https://angel.co/taylor> On Wed, Apr 25, 2018 at 11:22 AM, Ruslan Dautkhanov <[email protected]> wrote: > I used "Executor" as an Airflow term, not meant spark executor ... > Like Spark would be one of Executors > in here > https://github.com/apache/incubator-airflow/tree/master/airflow/executors > or in here > https://github.com/apache/incubator-airflow/tree/master/ > airflow/contrib/executors > > Thanks. > > > > -- > Ruslan Dautkhanov > > On Wed, Apr 25, 2018 at 9:17 AM, Bolke de Bruin <[email protected]> wrote: > > > Im a bit lost on the spark executor to be honest. To my knowledge the > > spark driver creates spark executors which run spark code. In other words > > in can’t arbitrarily run generic code. Or can it? > > > > B. > > > > Verstuurd vanaf mijn iPad > > > > > Op 25 apr. 2018 om 17:11 heeft Ruslan Dautkhanov <[email protected] > > > > het volgende geschreven: > > > > > > Now I think if Airflow on PySpark Executor would be an easier target. > > > Spark runs on YARN, Mesos and now Kubernetes. > > > So PySpark Executor would give Airflow porting to these schedulers. > > > It's my understanding we now have only Spark Operator and not Executor. > > > > > > Thanks! > > > > > > > > > > > > -- > > > Ruslan Dautkhanov > > > > > >> On Tue, Apr 24, 2018 at 3:20 PM, Ace Haidrey <[email protected]> > > wrote: > > >> > > >> Hey I didn’t know this Bolke, I was under the impression of the same > as > > >> Ruslan. > > >> Thanks for the share > > >> > > >> Sent from my iPhone > > >> > > >>> On Apr 24, 2018, at 2:12 PM, Bolke de Bruin <[email protected]> > wrote: > > >>> > > >>> It actually can nowadays: https://cdn.oreillystatic.com/ > > >> en/assets/1/event/269/HDFS%20on%20Kubernetes_%20Tech% > > >> 20deep%20dive%20on%20locality%20and%20security%20Presentation.pptx > > >>> > > >>> We also have an on premise setup with ceph (s3a) and HDFS for when we > > >> need the speed and kubernetes for our workloads. We are kicking out > Yarn > > >> (and hive etc for that matter). > > >>> > > >>> Bolke > > >>> > > >>> > > >>> > > >>> Verstuurd vanaf mijn iPad > > >>> > > >>>> Op 24 apr. 2018 om 22:50 heeft Ruslan Dautkhanov < > > [email protected]> > > >> het volgende geschreven: > > >>>> > > >>>> Kubernetes is a "monolithic" 1-level scheduler that can't handle > what > > >> YARN > > >>>> can - for example schedule tasks local to data. > > >>>> Hadoop has multiple levels of data locality (node-local, > rack-local) - > > >> so > > >>>> computation happens local to data to minimize network > > >>>> data transfer which is expensive. > > >>>> K8s wasn't designed to handle this scheduling scenarios, as far as I > > >> know. > > >>>> > > >>>> For cloud deployments where we don't have data locality problem > > >> (because of > > >>>> s3 is being used instead of storage local > > >>>> to servers), k8s might be okay. > > >>>> > > >>>> Nice comparison [1] of k8s vs two-level schedulers like yarn and > > messos > > >> .. > > >>>> although I think it's an offtopic. > > >>>> > > >>>> We're mostly on-prem and we don't see kubernetes take over yarn any > > time > > >>>> soon. > > >>>> > > >>>> Thanks. > > >>>> > > >>>> > > >>>> > > >>>> [1] > > >>>> > > >>>> https://aaltodoc.aalto.fi/bitstream/handle/123456789/ > > >> 27061/master_Ravula_Shashi_2017.pdf?sequence=1 > > >>>> > > >>>> *2.3.2 Monolithic Schedulers * > > >>>> > > >>>> > > >>>> > > >>>> Monolithic schedulers use a single, centralized scheduling algorithm > > for > > >>>> all jobs. All workload is run through the same scheduler and same > > >>>> scheduling logic. Swarm, > > >>>> Fleet, Borg and Kubernetes adopt monolithic schedulers. Kubernetes > > >>>> improvised on basic monolithic version of Borg and Swarm schedulers. > > >> This > > >>>> type of schedulers are not suitable for running heterogeneous modern > > >>>> workloads which include Spark jobs, containers, and other long > running > > >> jobs, > > >>>> etc. > > >>>> > > >>>> > > >>>> > > >>>> *2.3.3 Two Level Schedulers * > > >>>> > > >>>> > > >>>> > > >>>> Two-level schedulers address the drawbacks of a monolithic scheduler > > by > > >>>> separating concerns of resource allocation and task placement. An > > active > > >>>> resource manager offers compute resources to multiple parallel, > > >> independent > > >>>> “scheduler frameworks”. The Mesos cluster manager pioneered this > > >> approach, > > >>>> and YARN supports a limited version of it. In Mesos, resources are > > >> offered > > >>>> to application-level schedulers. This allows for custom, > > >> workload-specific > > >>>> scheduling policies. The drawback with this type of scheduling > > >> architecture > > >>>> is that the application level frameworks cannot see all the possible > > >>>> placement options anymore. Instead, they only see those options that > > >>>> correspond to resources offered (Mesos) or allocated (YARN) by the > > >> resource > > >>>> manager component. This makes priority preemption (higher priority > > tasks > > >>>> kick out lower priority ones) difficult. > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> -- > > >>>> Ruslan Dautkhanov > > >>>> > > >>>>> On Tue, Apr 24, 2018 at 2:22 PM, Bolke de Bruin <[email protected] > > > > >> wrote: > > >>>>> > > >>>>> Happy to have it as a contrib executor. However, I personally think > > >> yarn > > >>>>> is a dead end. It has a lot of catching up to do and all the > momentum > > >> is > > >>>>> with kubernetes. > > >>>>> > > >>>>> B. > > >>>>> > > >>>>> Verstuurd vanaf mijn iPad > > >>>>> > > >>>>>> Op 24 apr. 2018 om 22:13 heeft Ruslan Dautkhanov < > > >> [email protected]> > > >>>>> het volgende geschreven: > > >>>>>> > > >>>>>> With Hadoop 3's Docker on YARN support, I think YARN becomes > > >>>>>> somewhat a competitor for Kubernetes. > > >>>>>> > > >>>>>> Great job on adding k8s support to Airflow. > > >>>>>> > > >>>>>> Very similarly I see Airflow could integrate with YARN and use > > >>>>>> its infrastructure as an "executor" .. have anyone explored > > >> feasibility > > >>>>> of > > >>>>>> this approach? > > >>>>>> > > >>>>>> > > >>>>>> Thanks! > > >>>>>> Ruslan Dautkhanov > > >>>>> > > >> > > >
