I used "Executor" as an Airflow term, not meant spark executor ... Like Spark would be one of Executors in here https://github.com/apache/incubator-airflow/tree/master/airflow/executors or in here https://github.com/apache/incubator-airflow/tree/master/airflow/contrib/executors
Thanks. -- Ruslan Dautkhanov On Wed, Apr 25, 2018 at 9:17 AM, Bolke de Bruin <[email protected]> wrote: > Im a bit lost on the spark executor to be honest. To my knowledge the > spark driver creates spark executors which run spark code. In other words > in can’t arbitrarily run generic code. Or can it? > > B. > > Verstuurd vanaf mijn iPad > > > Op 25 apr. 2018 om 17:11 heeft Ruslan Dautkhanov <[email protected]> > het volgende geschreven: > > > > Now I think if Airflow on PySpark Executor would be an easier target. > > Spark runs on YARN, Mesos and now Kubernetes. > > So PySpark Executor would give Airflow porting to these schedulers. > > It's my understanding we now have only Spark Operator and not Executor. > > > > Thanks! > > > > > > > > -- > > Ruslan Dautkhanov > > > >> On Tue, Apr 24, 2018 at 3:20 PM, Ace Haidrey <[email protected]> > wrote: > >> > >> Hey I didn’t know this Bolke, I was under the impression of the same as > >> Ruslan. > >> Thanks for the share > >> > >> Sent from my iPhone > >> > >>> On Apr 24, 2018, at 2:12 PM, Bolke de Bruin <[email protected]> wrote: > >>> > >>> It actually can nowadays: https://cdn.oreillystatic.com/ > >> en/assets/1/event/269/HDFS%20on%20Kubernetes_%20Tech% > >> 20deep%20dive%20on%20locality%20and%20security%20Presentation.pptx > >>> > >>> We also have an on premise setup with ceph (s3a) and HDFS for when we > >> need the speed and kubernetes for our workloads. We are kicking out Yarn > >> (and hive etc for that matter). > >>> > >>> Bolke > >>> > >>> > >>> > >>> Verstuurd vanaf mijn iPad > >>> > >>>> Op 24 apr. 2018 om 22:50 heeft Ruslan Dautkhanov < > [email protected]> > >> het volgende geschreven: > >>>> > >>>> Kubernetes is a "monolithic" 1-level scheduler that can't handle what > >> YARN > >>>> can - for example schedule tasks local to data. > >>>> Hadoop has multiple levels of data locality (node-local, rack-local) - > >> so > >>>> computation happens local to data to minimize network > >>>> data transfer which is expensive. > >>>> K8s wasn't designed to handle this scheduling scenarios, as far as I > >> know. > >>>> > >>>> For cloud deployments where we don't have data locality problem > >> (because of > >>>> s3 is being used instead of storage local > >>>> to servers), k8s might be okay. > >>>> > >>>> Nice comparison [1] of k8s vs two-level schedulers like yarn and > messos > >> .. > >>>> although I think it's an offtopic. > >>>> > >>>> We're mostly on-prem and we don't see kubernetes take over yarn any > time > >>>> soon. > >>>> > >>>> Thanks. > >>>> > >>>> > >>>> > >>>> [1] > >>>> > >>>> https://aaltodoc.aalto.fi/bitstream/handle/123456789/ > >> 27061/master_Ravula_Shashi_2017.pdf?sequence=1 > >>>> > >>>> *2.3.2 Monolithic Schedulers * > >>>> > >>>> > >>>> > >>>> Monolithic schedulers use a single, centralized scheduling algorithm > for > >>>> all jobs. All workload is run through the same scheduler and same > >>>> scheduling logic. Swarm, > >>>> Fleet, Borg and Kubernetes adopt monolithic schedulers. Kubernetes > >>>> improvised on basic monolithic version of Borg and Swarm schedulers. > >> This > >>>> type of schedulers are not suitable for running heterogeneous modern > >>>> workloads which include Spark jobs, containers, and other long running > >> jobs, > >>>> etc. > >>>> > >>>> > >>>> > >>>> *2.3.3 Two Level Schedulers * > >>>> > >>>> > >>>> > >>>> Two-level schedulers address the drawbacks of a monolithic scheduler > by > >>>> separating concerns of resource allocation and task placement. An > active > >>>> resource manager offers compute resources to multiple parallel, > >> independent > >>>> “scheduler frameworks”. The Mesos cluster manager pioneered this > >> approach, > >>>> and YARN supports a limited version of it. In Mesos, resources are > >> offered > >>>> to application-level schedulers. This allows for custom, > >> workload-specific > >>>> scheduling policies. The drawback with this type of scheduling > >> architecture > >>>> is that the application level frameworks cannot see all the possible > >>>> placement options anymore. Instead, they only see those options that > >>>> correspond to resources offered (Mesos) or allocated (YARN) by the > >> resource > >>>> manager component. This makes priority preemption (higher priority > tasks > >>>> kick out lower priority ones) difficult. > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> Ruslan Dautkhanov > >>>> > >>>>> On Tue, Apr 24, 2018 at 2:22 PM, Bolke de Bruin <[email protected]> > >> wrote: > >>>>> > >>>>> Happy to have it as a contrib executor. However, I personally think > >> yarn > >>>>> is a dead end. It has a lot of catching up to do and all the momentum > >> is > >>>>> with kubernetes. > >>>>> > >>>>> B. > >>>>> > >>>>> Verstuurd vanaf mijn iPad > >>>>> > >>>>>> Op 24 apr. 2018 om 22:13 heeft Ruslan Dautkhanov < > >> [email protected]> > >>>>> het volgende geschreven: > >>>>>> > >>>>>> With Hadoop 3's Docker on YARN support, I think YARN becomes > >>>>>> somewhat a competitor for Kubernetes. > >>>>>> > >>>>>> Great job on adding k8s support to Airflow. > >>>>>> > >>>>>> Very similarly I see Airflow could integrate with YARN and use > >>>>>> its infrastructure as an "executor" .. have anyone explored > >> feasibility > >>>>> of > >>>>>> this approach? > >>>>>> > >>>>>> > >>>>>> Thanks! > >>>>>> Ruslan Dautkhanov > >>>>> > >> >
