Does anyone have experience using Airflow to launch Spark jobs on an Amazon
EMR cluster?

I have an Airflow cluster - separate from my EMR cluster - built as docker
containers. I want to have Airflow submit jobs to an existing EMR cluster
(though in the future I want to have Airflow start and stop clusters).

I could copy the Hadoop configs from EMR to each of the Airflow nodes, but
that's a pain. It'll be even more of a pain when I want to have Airflow
create and destroy clusters. So I'd rather not take this approach.

The only alternative I can think of is to use SSH to execute the
spark-submit command on the EMR master node. This is simple enough, except
Airflow will need the identity file to get access by SSH. Just copying the
identity file to the Airflow nodes is problematic because it's in docker
and I don't want this file in my Git repo.

Is there anyone with a similar setup that would care to share their
solution?

--
Daniel Siegmann
Senior Software Engineer
*SecurityScorecard Inc.*
214 W 29th Street, 5th Floor
New York, NY 10001

Reply via email to