Since you already on AWS, the simplest thing I could think of is to write a signal file once the job finished and the downstream job waiting for the signal file. In other words, the same pattern how the Hadoop jobs writing `_SUCCESS` file and the downstream jobs depends on the signal file.
Regards, Ananth.P, On 28 May 2018 at 13:06, Stefan Seelmann <m...@stefan-seelmann.de> wrote: > Thanks Christopher for the idea. That would work, we already have such a > "listener" that polls a queue (SQS) and creates the DAG runs. However it > would have been nice to have the full process in one DAG to have a > better overview about running jobs and leverage the gantt chart, but I > think this can be accomplished via custom plugins and views. > > On 05/28/2018 08:43 PM, Christopher Bockman wrote: > > Haven't done this, but we'll have a similar need in the future, so have > > investigated a little. > > > > What about a design pattern something like this: > > > > 1) When jobs are done (ready for further processing) they publish those > > details to a queue (such as GC Pub/Sub or any other sort of queue) > > > > 2) A single "listener" DAG sits and periodically checks that queue. If > it > > finds anything on it, it triggers (via DAG trigger) all of the DAGs which > > are on the queue.* > > > > * = if your triggering volume is too high, this may cause airflow issues > w/ > > too many going at once; this could presumably be solved then via custom > > rate-limiting on firing these > > > > 3) The listener DAG resets itself (triggers itself) > > > > > > On Mon, May 28, 2018 at 7:17 AM, Driesprong, Fokko <fo...@driesprong.frl > > > > wrote: > > > >> Hi Stefan, > >> > >> Afaik there isn't a more efficient way of doing this. DAGs that are > relying > >> on a lot of sensors are experiencing the same issues. The only way right > >> now, I can think of, is doing updating the state directly in the > database. > >> But then you need to know what you are doing. I can image that this > would > >> be feasible by using an AWS lambda function. Hope this helps. > >> > >> Cheers, Fokko > >> > >> 2018-05-26 17:50 GMT+02:00 Stefan Seelmann <m...@stefan-seelmann.de>: > >> > >>> Hello, > >>> > >>> I have a DAG (externally triggered) where some processing is done at an > >>> external system (EC2 instance). The processing is started by an Airflow > >>> task (via HTTP request). The DAG should only continue once that > >>> processing is completed. In a first naive implementation I created a > >>> sensor that gets the progress (via HTTP request) and only if status is > >>> "finished" returns true and the DAG run continues. That works but... > >>> > >>> ... the external processing can take hours or days, and during that > time > >>> a worker is occupied which does nothing but HTTP GET and sleep. There > >>> will be hundreds of DAG runs in parallel which means hundreds of > workers > >>> are occupied. > >>> > >>> I looked into other operators that do computation on external systems > >>> (ECSOperator, AWSBatchOperator) but they also follow that pattern and > >>> just wait/sleep. > >>> > >>> So I want to ask if there is a more efficient way to build such a > >>> workflow with Airflow? > >>> > >>> Kind Regards, > >>> Stefan > >>> > >> > > > >