[ 
https://issues.apache.org/jira/browse/AIRFLOW-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16157200#comment-16157200
 ] 

Taylor Edmiston commented on AIRFLOW-1131:
------------------------------------------

We're experiencing this same issue with successfully exiting Docker containers 
not getting picked up by Airflow as well running a fork of 1.8.1.

While we are using a custom Mesos executor (eventually hoping to upstream that 
as the contrib one is stale), the point in the code where the DockerOperator 
task instances are hanging occurs in the log streaming right before Airflow 
receives the exit code back.  What's weird about this for us is that in only 
happens ~20% of the time, and also that the DockerOperator's container has 
always already successfully printed its final output and shows exit code 0 in 
docker.  However, Airflow seems to lose track of the exited Docker container.

We see it getting stuck just before this line (the {{exit_code = ...}} line 
never runs):

https://github.com/apache/incubator-airflow/blob/1.8.1/airflow/operators/docker_operator.py#L180

Upon deep debugging we believe it's an issue with the Docker Python client 
(previously docker-py on PyPI, now just docker as of 2.x).  We had docker-py 
pinned at 1.8.1 but the issue seems to still occur on the latest subversion of 
1.x which is 1.10.6.  I'm going to investigate if upgrading to the latest 
version of docker 2.x which contains some breaking changes solves the issue for 
us.

> DockerOperator jobs time out and get stuck in "running" forever
> ---------------------------------------------------------------
>
>                 Key: AIRFLOW-1131
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-1131
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 1.9.0
>         Environment: Python 2.7.12
> git+git://github.com/apache/incubator-airflow.git@35e43f5067f4741640278b765c0e54e4fd45ffa3#egg=airflow[async,password,celery,crypto,postgres,hive,hdfs,jdbc]
>            Reporter: Vitor Baptista
>
> With the following DAG and task:
> {code}
> import os
> from datetime import datetime, timedelta
> from airflow.models import DAG
> from airflow.operators.docker_operator import DockerOperator
> default_args = {
>     'owner': 'airflow',
>     'depends_on_past': False,
>     'start_date': datetime(2017, 1, 1),
>     'retries': 3,
>     'retry_delay': timedelta(minutes=10),
> }
> dag = DAG(
>     dag_id='smoke_test',
>     default_args=default_args,
>     max_active_runs=1,
>     schedule_interval='@daily'
> )
> sleep_forever_task = DockerOperator(
>     task_id='sleep_forever',
>     dag=dag,
>     image='alpine:latest',
>     api_version=os.environ.get('DOCKER_API_VERSION', '1.23'),
>     command='sleep {}'.format(60 * 60 * 24),
> )
> {code}
> When I run it, this is what I get:
> {code}
> *** Log file isn't local.
> *** Fetching here: 
> http://589ea17432ec:8793/log/smoke_test/sleep_forever/2017-04-18T00:00:00
> [2017-04-20 11:19:58,258] {models.py:172} INFO - Filling up the DagBag from 
> /usr/local/airflow/dags/smoke_test.py
> [2017-04-20 11:19:58,438] {base_task_runner.py:112} INFO - Running: ['bash', 
> '-c', u'airflow run smoke_test sleep_forever 2017-04-18T00:00:00 --job_id 
> 2537 --raw -sd DAGS_FOLDER/smoke_test.py']
> [2017-04-20 11:19:58,888] {base_task_runner.py:95} INFO - Subtask: 
> /usr/local/airflow/src/airflow/airflow/configuration.py:128: 
> DeprecationWarning: This method will be removed in future versions.  Use 
> 'parser.read_file()' instead.
> [2017-04-20 11:19:58,888] {base_task_runner.py:95} INFO - Subtask:   
> self.readfp(StringIO.StringIO(string))
> [2017-04-20 11:19:59,214] {base_task_runner.py:95} INFO - Subtask: 
> [2017-04-20 11:19:59,214] {__init__.py:56} INFO - Using executor 
> CeleryExecutor
> [2017-04-20 11:19:59,227] {base_task_runner.py:95} INFO - Subtask: 
> [2017-04-20 11:19:59,227] {driver.py:120} INFO - Generating grammar tables 
> from /usr/lib/python2.7/lib2to3/Grammar.txt
> [2017-04-20 11:19:59,244] {base_task_runner.py:95} INFO - Subtask: 
> [2017-04-20 11:19:59,244] {driver.py:120} INFO - Generating grammar tables 
> from /usr/lib/python2.7/lib2to3/PatternGrammar.txt
> [2017-04-20 11:19:59,377] {base_task_runner.py:95} INFO - Subtask: 
> [2017-04-20 11:19:59,377] {models.py:172} INFO - Filling up the DagBag from 
> /usr/local/airflow/dags/smoke_test.py
> [2017-04-20 11:19:59,597] {base_task_runner.py:95} INFO - Subtask: 
> [2017-04-20 11:19:59,597] {models.py:1146} INFO - Dependencies all met for 
> <TaskInstance: smoke_test.sleep_forever 2017-04-18 00:00:00 [queued]>
> [2017-04-20 11:19:59,605] {base_task_runner.py:95} INFO - Subtask: 
> [2017-04-20 11:19:59,605] {models.py:1146} INFO - Dependencies all met for 
> <TaskInstance: smoke_test.sleep_forever 2017-04-18 00:00:00 [queued]>
> [2017-04-20 11:19:59,605] {base_task_runner.py:95} INFO - Subtask: 
> [2017-04-20 11:19:59,605] {models.py:1338} INFO - 
> [2017-04-20 11:19:59,606] {base_task_runner.py:95} INFO - Subtask: 
> --------------------------------------------------------------------------------
> [2017-04-20 11:19:59,606] {base_task_runner.py:95} INFO - Subtask: Starting 
> attempt 1 of 4
> [2017-04-20 11:19:59,606] {base_task_runner.py:95} INFO - Subtask: 
> --------------------------------------------------------------------------------
> [2017-04-20 11:19:59,606] {base_task_runner.py:95} INFO - Subtask: 
> [2017-04-20 11:19:59,620] {base_task_runner.py:95} INFO - Subtask: 
> [2017-04-20 11:19:59,620] {models.py:1362} INFO - Executing 
> <Task(DockerOperator): sleep_forever> on 2017-04-18 00:00:00
> [2017-04-20 11:19:59,662] {base_task_runner.py:95} INFO - Subtask: 
> [2017-04-20 11:19:59,661] {docker_operator.py:132} INFO - Starting docker 
> container from image alpine:latest
> [2017-04-20 12:21:25,661] {models.py:172} INFO - Filling up the DagBag from 
> /usr/local/airflow/dags/smoke_test.py
> [2017-04-20 12:21:25,809] {base_task_runner.py:112} INFO - Running: ['bash', 
> '-c', u'airflow run smoke_test sleep_forever 2017-04-18T00:00:00 --job_id 
> 2574 --raw -sd DAGS_FOLDER/smoke_test.py']
> [2017-04-20 12:21:26,117] {base_task_runner.py:95} INFO - Subtask: 
> /usr/local/airflow/src/airflow/airflow/configuration.py:128: 
> DeprecationWarning: This method will be removed in future versions.  Use 
> 'parser.read_file()' instead.
> [2017-04-20 12:21:26,117] {base_task_runner.py:95} INFO - Subtask:   
> self.readfp(StringIO.StringIO(string))
> [2017-04-20 12:21:26,301] {base_task_runner.py:95} INFO - Subtask: 
> [2017-04-20 12:21:26,301] {__init__.py:56} INFO - Using executor 
> CeleryExecutor
> [2017-04-20 12:21:26,310] {base_task_runner.py:95} INFO - Subtask: 
> [2017-04-20 12:21:26,310] {driver.py:120} INFO - Generating grammar tables 
> from /usr/lib/python2.7/lib2to3/Grammar.txt
> [2017-04-20 12:21:26,324] {base_task_runner.py:95} INFO - Subtask: 
> [2017-04-20 12:21:26,324] {driver.py:120} INFO - Generating grammar tables 
> from /usr/lib/python2.7/lib2to3/PatternGrammar.txt
> [2017-04-20 12:21:26,426] {base_task_runner.py:95} INFO - Subtask: 
> [2017-04-20 12:21:26,426] {models.py:172} INFO - Filling up the DagBag from 
> /usr/local/airflow/dags/smoke_test.py
> [2017-04-20 12:21:26,564] {base_task_runner.py:95} INFO - Subtask: 
> [2017-04-20 12:21:26,563] {models.py:1140} INFO - Dependencies not met for 
> <TaskInstance: smoke_test.sleep_forever 2017-04-18 00:00:00 [running]>, 
> dependency 'Task Instance State' FAILED: Task is in the 'running' state which 
> is not a valid state for execution. The task must be cleared in order to be 
> run.
> [2017-04-20 12:21:26,564] {base_task_runner.py:95} INFO - Subtask: 
> [2017-04-20 12:21:26,564] {models.py:1140} INFO - Dependencies not met for 
> <TaskInstance: smoke_test.sleep_forever 2017-04-18 00:00:00 [running]>, 
> dependency 'Task Instance Not Already Running' FAILED: Task is already 
> running, it started on 2017-04-20 11:19:59.597425.
> [2017-04-20 12:21:30,821] {jobs.py:2163} WARNING - Recorded pid 113 is not a 
> descendant of the current pid 178
> {code}
> Note that it runs the task normally, and after about 1 hour it tries to 
> re-load the task, running it again, but then fails because the {{subprocess}} 
> started isn't a child of the current process. After this, the task is still 
> in {{running}} state, never changing to {{failed}}.
> I haven't tested, but I suspect this doesn't happen if the task keeps 
> printing something.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to