Re: handling backfill overtime

Nadeem Ahmed Nazeer Thu, 25 Aug 2016 13:45:20 -0700

Hello Airflowers,

Does someone see a better way to do this? It would really my Airflow set
up.


Thanks,
Nadeem

On Thu, Aug 11, 2016 at 3:09 PM, Nadeem Ahmed Nazeer <[email protected]>
wrote:

> Hello,
>
> My airflow dag consists of 2 tasks,
> 1) map-reduce jobs (writes output to s3)
> 2) hive loads (using files from 1)
>
> My EMR hadoop cluster is running on aws spot instances. So when spot
> instance pricing go up, my cluster would die and a new one would come up.
>
> In the event of a cluster death, i am clearing all the hive load tasks
> from Airflow. This way it would rebuild the tables back in the new cluster
> based on the files in s3.
>
> But overtime, when the backfill becomes very large this approach becomes
> inefficient. My dag run frequency is 3 hours (8 runs a day). So for
> example, if the cluster goes down after a month, airflow will now have to
> backfill 240 (8 * 30) tasks that got cleared. This backfill only gets
> bigger with time.
>
> What could be a better way to handle this? Currently, I'm planning to
> re-base airflow manually once in a month where in I will bring down
> everything and run airflow with new start date of current day. This will
> reduce the backfill and keep it under limits of a month. But there's got to
> be a better way of doing this.
>
> Please provide any suggestions.
>
> Thanks,
> Nadeem
>

Re: handling backfill overtime

Reply via email to