Hello Airflowers, Does someone see a better way to do this? It would really my Airflow set up.
Thanks, Nadeem On Thu, Aug 11, 2016 at 3:09 PM, Nadeem Ahmed Nazeer <[email protected]> wrote: > Hello, > > My airflow dag consists of 2 tasks, > 1) map-reduce jobs (writes output to s3) > 2) hive loads (using files from 1) > > My EMR hadoop cluster is running on aws spot instances. So when spot > instance pricing go up, my cluster would die and a new one would come up. > > In the event of a cluster death, i am clearing all the hive load tasks > from Airflow. This way it would rebuild the tables back in the new cluster > based on the files in s3. > > But overtime, when the backfill becomes very large this approach becomes > inefficient. My dag run frequency is 3 hours (8 runs a day). So for > example, if the cluster goes down after a month, airflow will now have to > backfill 240 (8 * 30) tasks that got cleared. This backfill only gets > bigger with time. > > What could be a better way to handle this? Currently, I'm planning to > re-base airflow manually once in a month where in I will bring down > everything and run airflow with new start date of current day. This will > reduce the backfill and keep it under limits of a month. But there's got to > be a better way of doing this. > > Please provide any suggestions. > > Thanks, > Nadeem >
