What version of airflow? From the top of my mind 1.8.X

1) we do use db locking
2) we check the state after we get the lock
3) I don't think the task sets a state if it finds out it is running somewhere 
else

Maybe we do something at the executor/scheduler level. That I need to 
investigate if this issue is on a recent version. 

Bolke

Sent from my iPhone

> On 4 Aug 2017, at 19:24, George Leslie-Waksman 
> <[email protected]> wrote:
> 
> Pretty sure (not 100%) what is happening is:
> 
>   1. Scheduler bugs result in task getting scheduled twice
>   2. Worker 1 grabs task
>   3. Worker 2 grabs task
>   4. Worker 1 starts task
>   5. Worker 2 starts task
>   6. Worker 2 sees that Worker 1 has started and plans to abort
>   7. Worker 1 finishes and marks task as done
>   8. Worker 2 finishes aborting and marks task as not done
> 
> 
>> On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <[email protected]> wrote:
>> 
>> Hey Max,
>> 
>> Thanks for the suggestions. I believe it was a retry (I'm using remote
>> logging so I can only check after the task completes), but the UI never
>> reported it as such. The latest_heartbeat column is in the jobs table, and
>> interestingly I do see some running jobs that haven't heartbeated for ~22
>> minutes. They are LocalTaskJob instances with CeleryExecutor properly
>> listed as the executory class. I can't really correlate these to a specific
>> task instance, however, as there doesn't appear to be any key written to
>> the jobs table (the dag_id column is all null, and there's no task_id
>> column or anything).
>> 
>> Any ideas on what could be making these tasks stop heartbeating regularly?
>> That could explain why eventually (after an overnight period of time)
>> everything is marked as finished in the Airflow UI: eventually these tasks
>> do heartbeat again, but quite long after they are finished running.
>> 
>> Thanks again!
>> ᐧ
>> 
>> --
>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
>> New Relic
>> 
>> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin <
>> [email protected]> wrote:
>> 
>>> Are you sure there hasn't been a retry at that point? [One of] the
>> expected
>>> behavior is the one I described, where if a task finished without
>> reporting
>>> it's success [or failure], it will stay marked as RUNNING, but will fail
>> to
>>> emit a heartbeat (which is a timestamp updated in the task_instance table
>>> as last_heartbeat or something).  The scheduler monitors for RUNNING
>> tasks
>>> without heartbeat and eventually will handle the failure (send emails,
>> call
>>> on_failure_callback, ...).
>>> 
>>> Looking for heartbeat in the DB might give you some clues as to what is
>>> going on.
>>> 
>>> Also there have been versions where we'd occasionally see double
>>> triggering, and double firing, which can be confusing. Then you can have
>>> different processes reporting their status and debugging those issues can
>>> be problematic. I think there's good prevention against that now, using
>>> database transactions as the task instance sets itself as RUNNING. I'm
>> not
>>> sure if 1.8.0 is 100% clean from that regard.
>>> 
>>> Max
>>> 
>>>> On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <[email protected]> wrote:
>>>> 
>>>> It happens mostly when the scheduler is catching up. More specifically,
>>>> when I load a brand new DAG with a start date in the past. Usually I
>> have
>>>> it set to run 5 DAG runs at the same time, and up to 16 tasks at the
>> same
>>>> time.
>>>> 
>>>> What I've also noticed is that the tasks will sit completed in reality
>>> but
>>>> uncompleted in the Airflow DB for many hours, but if I just leave them
>>> all
>>>> sitting there over night they all tend to be marked complete the next
>>>> morning. Perhaps this points to some sort of Celery timeout or
>> connection
>>>> retry interval?
>>>> ᐧ
>>>> 
>>>> --
>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
>> Engagement
>>> |
>>>> New Relic
>>>> 
>>>> On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
>>>> [email protected]> wrote:
>>>> 
>>>>> By the time "INFO - Task exited with return code 0" gets logged, the
>>> task
>>>>> should have been marked as successful by the subprocess. I have no
>>>> specific
>>>>> intuition as to what the issue may be.
>>>>> 
>>>>> I'm guessing at that point the job stops emitting heartbeat and
>>>> eventually
>>>>> the scheduler will handle it as a failure?
>>>>> 
>>>>> How often does that happen?
>>>>> 
>>>>> Max
>>>>> 
>>>>> On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <[email protected]>
>> wrote:
>>>>> 
>>>>>> From what I can tell, it only affects CeleryExecutor. I've never
>> seen
>>>>> this
>>>>>> behavior with LocalExecutor before.
>>>>>> 
>>>>>> Max, do you know anything about this type of failure mode?
>>>>>> ᐧ
>>>>>> 
>>>>>> --
>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
>>>> Engagement
>>>>> |
>>>>>> New Relic
>>>>>> 
>>>>>> On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <
>> [email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>>> We have the exact same problem. In our case, it's a bash operator
>>>>>> starting
>>>>>>> a docker container. The container and process it ran exit, but
>> the
>>>>>> 'docker
>>>>>>> run' command is still showing up in the process table, waiting
>> for
>>> an
>>>>>>> event.
>>>>>>> I'm trying to switch to LocalExecutor to see if that will help.
>>>>>>> 
>>>>>>> _jonas
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <[email protected]>
>>>> wrote:
>>>>>>> 
>>>>>>>> Hello,
>>>>>>>> 
>>>>>>>> Has anyone seen the behavior when using CeleryExecutor where
>>>> workers
>>>>>> will
>>>>>>>> finish their tasks ("INFO - Task exited with return code 0"
>> shows
>>>> in
>>>>>> the
>>>>>>>> logs) but are never marked as complete in the airflow DB or UI?
>>>>>>> Effectively
>>>>>>>> this causes tasks to hang even though they are complete, and
>> the
>>>> DAG
>>>>>> will
>>>>>>>> not continue.
>>>>>>>> 
>>>>>>>> This is happening on 1.8.0. Anyone else seen this or perhaps
>>> have a
>>>>>>>> workaround?
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
>>>>>> Engagement
>>>>>>> |
>>>>>>>> New Relic
>>>>>>>> ᐧ
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 

Reply via email to