iantbutler01 opened a new issue #10122:
URL: https://github.com/apache/airflow/issues/10122


   <!--
   
   Welcome to Apache Airflow!  For a smooth issue process, try to answer the 
following questions.
   Don't worry if they're not all applicable; just try to include what you can 
:-)
   
   If you need to include code snippets or logs, please put them in fenced code
   blocks.  If they're super-long, please use the details tag like
   <details><summary>super-long log</summary> lots of stuff </details>
   
   Please delete these comment blocks before submitting the issue.
   
   -->
   
   <!--
   
   IMPORTANT!!!
   
   PLEASE CHECK "SIMILAR TO X EXISTING ISSUES" OPTION IF VISIBLE
   NEXT TO "SUBMIT NEW ISSUE" BUTTON!!!
   
   PLEASE CHECK IF THIS ISSUE HAS BEEN REPORTED PREVIOUSLY USING SEARCH!!!
   
   Please complete the next sections or the issue will be closed.
   This questions are the first thing we need to know to understand the context.
   
   -->
   
   **Apache Airflow version**:
   1.10.10
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`):
   v1.16.8-eks-e16311
   **Environment**:
   <details>
   KUBERNETES_SERVICE_PORT_HTTPS=443
   AIRFLOW__SMTP__SMTP_PORT=25
   AIRFLOW__KUBERNETES__NAMESPACE=airflow
   AIRFLOW__SMTP__SMTP_PASSWORD=*snip*
   AIRFLOW__SMTP__SMTP_USER=*snip*
   KUBERNETES_SERVICE_PORT=443
   BOILING_LAND_WEB_PORT_8080_TCP_PORT=8080
   REDIS_PASSWORD=fjODRhL3FL6n0y4cA
   
AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABALES__AIRFLOW__CORE__REMOTE_BASE_LOGS_FOLDER=*snip*
   BOILING_LAND_WEB_SERVICE_PORT=8080
   HOSTNAME=boiling-land-scheduler-7bcb794c75-gjzjx
   PYTHON_VERSION=3.7.7
   LANGUAGE=C.UTF-8
   POSTGRES_PASSWORD=*snip*
   PIP_VERSION=19.0.2
   AIRFLOW__KUBERNETES__DELETE_WORKER_PODS_ON_FAILURE=False
   AIRFLOW__WEBSERVER__BASE_URL=http://localhost:8080
   AIRFLOW__SCHEDULER__CHILD_PROCESS_LOG_DIRECTORY=/opt/airflow/logs/scheduler
   AIRFLOW__CORE__DAGS_FOLDER=/opt/airflow/dags
   
AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABALES__AIRFLOW__CORE__REMOTE_LOG_CONN_ID=s3_connection
   BOILING_LAND_WEB_SERVICE_PORT_WEB=8080
   AIRFLOW__CORE__DONOT_PICKLE=false
   BOILING_LAND_WEB_PORT=tcp://172.20.191.242:8080
   PWD=/opt/airflow
   AIRFLOW_VERSION=1.10.10
   AIRFLOW__SMTP__SMTP_MAIL_FROM=*snip*
   AWS_ROLE_ARN=*snip*
   AIRFLOW__CORE__LOAD_EXAMPLES=False
   TZ=Etc/UTC
   [email protected]:whize/airflow-dags.git
   AIRFLOW__KUBERNETES__GIT_DAGS_FOLDER_MOUNT_POINT=/opt/airflow/dags
   HOME=/home/airflow
   AIRFLOW__KUBERNETES__ENV_FROM_CONFIGMAP_REF=boiling-land-env
   LANG=C.UTF-8
   KUBERNETES_PORT_443_TCP=tcp://172.20.0.1:443
   AIRFLOW_HOME=/opt/airflow
   DATABASE_USER=postgres
   AIRFLOW__KUBERNETES__GIT_SSH_KEY_SECRET_NAME=airflow-kube-pods-git
   DATABASE_PORT=5432
   GPG_KEY=*snip*
   
AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABALES__AIRFLOW__CORE__REMOTE_LOGGING=True
   AIRFLOW__CORE__EXECUTOR=KubernetesExecutor
   AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://*snip*
   AIRFLOW__KUBERNETES__RUN_AS_USER=50000
   AIRFLOW__CORE__BASE_LOG_FOLDER=/opt/airflow/logs
   
AIRFLOW__CORE__DAG_PROCESSOR_MANAGER_LOG_LOCATION=/opt/airflow/logs/dag_processor_manager/dag_processor_manager.log
   AIRFLOW__CORE__ENABLE_XCOM_PICKLING=false
   TERM=xterm
   AIRFLOW__SCHEDULER__MAX_THREADS=8
   AIRFLOW__KUBERNETES__WORKER_PODS_CREATION_BATCH_SIZE=5
   AIRFLOW_CONN_S3_CONNECTION=aws://
   
   AIRFLOW__KUBERNETES__WORKER_CONTAINER_REPOSITORY=apache/airflow
   DATABASE_DB=airflow
   AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG=1.10.10-python3.7
   BOILING_LAND_WEB_PORT_8080_TCP_PROTO=tcp
   AIRFLOW__KUBERNETES__IN_CLUSTER=True
   DATABASE_PASSWORD=*snip*
   AIRFLOW_GID=50000
   SHLVL=1
   
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
   KUBERNETES_PORT_443_TCP_PROTO=tcp
   BOILING_LAND_WEB_SERVICE_HOST=172.20.191.242
   LC_MESSAGES=C.UTF-8
   PYTHON_PIP_VERSION=20.0.2
   KUBERNETES_PORT_443_TCP_ADDR=172.20.0.1
   DATABASE_HOST=*snip*
   AIRFLOW__CORE__REMOTE_LOG_CONN_ID=s3_connection
   AIRFLOW__EMAIL__EMAIL_BACKEND=airflow.utils.email.send_email_smtp
   LC_CTYPE=C.UTF-8
   AIRFLOW__SMTP__SMTP_STARTTLS=False
   AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME=boiling-land
   PYTHON_GET_PIP_SHA256=*snip*
   AIRFLOW__CORE__SQL_ALCHEMY_CONN=*snip*
   KUBERNETES_SERVICE_HOST=172.20.0.1
   LC_ALL=C.UTF-8
   AIRFLOW__CORE__REMOTE_LOGGING=True
   KUBERNETES_PORT=tcp://172.20.0.1:443
   KUBERNETES_PORT_443_TCP_PORT=443
   AIRFLOW_KUBERNETES_ENVIRONMENT_VARIABLES_KUBE_CLIENT_REQUEST_TIMEOUT_SEC=50
   AIRFLOW__KUBERNETES__GIT_BRANCH=master
   
PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/d59197a3c169cef378a22428a3fa99d33e080a5d/get-pip.py
   AIRFLOW__KUBERNETES__DELETE_WORKER_PODS=False
   
PATH=/home/airflow/.local/bin:/home/airflow/.local/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
   AIRFLOW__KUBERNETES__DAGS_VOLUME_SUBPATH=repo/
   PYTHON_BASE_IMAGE=python:3.7-slim-buster
   AIRFLOW_UID=50000
   AIRFLOW__CORE__FERNET_KEY=*snip*
   DEBIAN_FRONTEND=noninteractive
   BOILING_LAND_WEB_PORT_8080_TCP=tcp://172.20.191.242:8080
   AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABALES__AIRFLOW__CORE__FERNET_KEY=*snip*
   AIRFLOW__SMTP__SMTP_SSL=False
   BOILING_LAND_WEB_PORT_8080_TCP_ADDR=172.20.191.242
   AIRFLOW__SMTP__SMTP_HOST=email-smtp.us-east-1.amazonaws.com
   _=/usr/bin/env
   </details>
   
   - **Cloud provider or hardware configuration**: AWS EKS
   
   - **OS** (e.g. from /etc/os-release):
   NAME="Amazon Linux"
   VERSION="2"
   
   - **Kernel** (e.g. `uname -a`): Linux<AWS_INTERNAL_HOSTNAME>.x86_64 #1 SMP 
Thu May 7 18:48:23 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
   
   
   - **Install tools**:
   - **Others**:
   
   **What happened**:
   Using the KubernetesExecutor a pod that prepares to launch a task running 
the KubernetesPodOperator is launched. This task fails due to an issue in the 
task definition such as an invalid option. The pod does not exit immediately it 
takes about 40minutes for it to exit after the failure and report its state in 
the UI. Interestingly the execution time is correctly listed as < 1 second.
   <!-- (please include exact error messages if you can) -->
   
   It also says that it is marking the job failed in the task logs on the 
launcher pod, which after about 40 minutes the state does change:
   
   <details>
   
   `[2020-08-03 04:28:32,844] {taskinstance.py:1202} INFO - Marking task as 
FAILED.dag_id=arxiv_crawler_pipeline, task_id=launch_crawl_pod, 
execution_date=20200803T042640, start_date=20200803T042652, 
end_date=20200803T042832
   `
   </details>
   
   The scheduler logs on the launcher pod say nothing about the failure though:
   <details>
   
   ```
   [2020-08-03 04:26:51,543] {__init__.py:51} INFO - Using executor 
LocalExecutor
   [2020-08-03 04:26:51,544] {dagbag.py:396} INFO - Filling up the DagBag from 
/opt/airflow/dags/crawlers/arxiv/arxiv_crawl_pipeline.py
   
/home/airflow/.local/lib/python3.7/site-packages/airflow/contrib/operators/kubernetes_pod_operator.py:159:
 PendingDeprecationWarning: Invalid arguments were passed to 
KubernetesPodOperator (task_id: launch_crawl_pod). Support for passing such 
arguments will be dropped in Airflow 2.0. Invalid arguments were:
   *args: ()
   **kwargs: {'reattach_on_restart': True, 'log_events_on_failure': True}
     super(KubernetesPodOperator, self).__init__(*args, resources=None, 
**kwargs)
   
/home/airflow/.local/lib/python3.7/site-packages/airflow/sensors/base_sensor_operator.py:71:
 PendingDeprecationWarning: Invalid arguments were passed to HttpSensor 
(task_id: wait_for_finish). Support for passing such arguments will be dropped 
in Airflow 2.0. Invalid arguments were:
   *args: ()
   **kwargs: {'result_check': <function check_http_response at 0x7fdc378bf320>}
     super(BaseSensorOperator, self).__init__(*args, **kwargs)
   Running %s on host %s <TaskInstance: arxiv_crawler_pipeline.launch_crawl_pod 
2020-08-03T04:26:40.850022+00:00 [queued]> 
arxivcrawlerpipelinelaunchcrawlpod-4c7e99ae14704b2b8fa0d64db508
   ```
   
   </details>
   
   **What you expected to happen**:
   The pod should exit immediately and report the failed task state in the 
metadata database which should then be reflected in the job UI in a much more 
timely fashion.
   <!-- What do you think went wrong? -->
   No idea, I've been looking at this for about 12 hours now and this report is 
my, I can't figure it out moment.
   **How to reproduce it**:
   
   Set up an airflow cluster on a Kubernetes cluster with the 
KubernetesExecutor and create a job that attempts to launch a 
KubernetesPodOperator task that will fail either in the attempt to launch or in 
the pod that is created by the task itself.
   
   
   How often does this problem occur? Once? Every time etc?
   Every single time.
   Any relevant logs to include? Put them here in side a detail tag:
   
   The logs don't really give any insight to why there is such a dramatic lag 
between failure and updating the metadata.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to