Oops, + Daniel On Mon, Sep 12, 2016 at 3:43 PM, Rob Froetscher <[email protected]> wrote:
> Hey Daniel, > > We also run airflow on docker and use EMR. > > I wrote a PR <https://github.com/apache/incubator-airflow/pull/1630> to > address EMR resources for airflow. It has been merged but has not been > released. The idea is that you can have your config as connections in the > DB and use the operators to interact with your cluster and the sensors to > wait for any action. > > There are two good example dags in the PR. > > https://github.com/apache/incubator-airflow/pull/1630/files > > We are using this currently for several jobs. Happy to answer any > questions you have about how we use it. > > Best, > > Rob > > On Mon, Sep 12, 2016 at 2:10 PM, Daniel Siegmann < > [email protected]> wrote: > >> Does anyone have experience using Airflow to launch Spark jobs on an >> Amazon >> EMR cluster? >> >> I have an Airflow cluster - separate from my EMR cluster - built as docker >> containers. I want to have Airflow submit jobs to an existing EMR cluster >> (though in the future I want to have Airflow start and stop clusters). >> >> I could copy the Hadoop configs from EMR to each of the Airflow nodes, but >> that's a pain. It'll be even more of a pain when I want to have Airflow >> create and destroy clusters. So I'd rather not take this approach. >> >> The only alternative I can think of is to use SSH to execute the >> spark-submit command on the EMR master node. This is simple enough, except >> Airflow will need the identity file to get access by SSH. Just copying the >> identity file to the Airflow nodes is problematic because it's in docker >> and I don't want this file in my Git repo. >> >> Is there anyone with a similar setup that would care to share their >> solution? >> >> -- >> Daniel Siegmann >> Senior Software Engineer >> *SecurityScorecard Inc.* >> 214 W 29th Street, 5th Floor >> New York, NY 10001 >> > >
