Oops, + Daniel

On Mon, Sep 12, 2016 at 3:43 PM, Rob Froetscher <[email protected]>
wrote:

> Hey Daniel,
>
> We also run airflow on docker and use EMR.
>
> I wrote a PR <https://github.com/apache/incubator-airflow/pull/1630> to
> address EMR resources for airflow. It has been merged but has not been
> released. The idea is that you can have your config as connections in the
> DB and use the operators to interact with your cluster and the sensors to
> wait for any action.
>
> There are two good example dags in the PR.
>
> https://github.com/apache/incubator-airflow/pull/1630/files
>
> We are using this currently for several jobs. Happy to answer any
> questions you have about how we use it.
>
> Best,
>
> Rob
>
> On Mon, Sep 12, 2016 at 2:10 PM, Daniel Siegmann <
> [email protected]> wrote:
>
>> Does anyone have experience using Airflow to launch Spark jobs on an
>> Amazon
>> EMR cluster?
>>
>> I have an Airflow cluster - separate from my EMR cluster - built as docker
>> containers. I want to have Airflow submit jobs to an existing EMR cluster
>> (though in the future I want to have Airflow start and stop clusters).
>>
>> I could copy the Hadoop configs from EMR to each of the Airflow nodes, but
>> that's a pain. It'll be even more of a pain when I want to have Airflow
>> create and destroy clusters. So I'd rather not take this approach.
>>
>> The only alternative I can think of is to use SSH to execute the
>> spark-submit command on the EMR master node. This is simple enough, except
>> Airflow will need the identity file to get access by SSH. Just copying the
>> identity file to the Airflow nodes is problematic because it's in docker
>> and I don't want this file in my Git repo.
>>
>> Is there anyone with a similar setup that would care to share their
>> solution?
>>
>> --
>> Daniel Siegmann
>> Senior Software Engineer
>> *SecurityScorecard Inc.*
>> 214 W 29th Street, 5th Floor
>> New York, NY 10001
>>
>
>

Reply via email to