iantbutler01 opened a new issue #10122: URL: https://github.com/apache/airflow/issues/10122
<!-- Welcome to Apache Airflow! For a smooth issue process, try to answer the following questions. Don't worry if they're not all applicable; just try to include what you can :-) If you need to include code snippets or logs, please put them in fenced code blocks. If they're super-long, please use the details tag like <details><summary>super-long log</summary> lots of stuff </details> Please delete these comment blocks before submitting the issue. --> <!-- IMPORTANT!!! PLEASE CHECK "SIMILAR TO X EXISTING ISSUES" OPTION IF VISIBLE NEXT TO "SUBMIT NEW ISSUE" BUTTON!!! PLEASE CHECK IF THIS ISSUE HAS BEEN REPORTED PREVIOUSLY USING SEARCH!!! Please complete the next sections or the issue will be closed. This questions are the first thing we need to know to understand the context. --> **Apache Airflow version**: 1.10.10 **Kubernetes version (if you are using kubernetes)** (use `kubectl version`): v1.16.8-eks-e16311 **Environment**: <details> KUBERNETES_SERVICE_PORT_HTTPS=443 AIRFLOW__SMTP__SMTP_PORT=25 AIRFLOW__KUBERNETES__NAMESPACE=airflow AIRFLOW__SMTP__SMTP_PASSWORD=*snip* AIRFLOW__SMTP__SMTP_USER=*snip* KUBERNETES_SERVICE_PORT=443 BOILING_LAND_WEB_PORT_8080_TCP_PORT=8080 REDIS_PASSWORD=fjODRhL3FL6n0y4cA AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABALES__AIRFLOW__CORE__REMOTE_BASE_LOGS_FOLDER=*snip* BOILING_LAND_WEB_SERVICE_PORT=8080 HOSTNAME=boiling-land-scheduler-7bcb794c75-gjzjx PYTHON_VERSION=3.7.7 LANGUAGE=C.UTF-8 POSTGRES_PASSWORD=*snip* PIP_VERSION=19.0.2 AIRFLOW__KUBERNETES__DELETE_WORKER_PODS_ON_FAILURE=False AIRFLOW__WEBSERVER__BASE_URL=http://localhost:8080 AIRFLOW__SCHEDULER__CHILD_PROCESS_LOG_DIRECTORY=/opt/airflow/logs/scheduler AIRFLOW__CORE__DAGS_FOLDER=/opt/airflow/dags AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABALES__AIRFLOW__CORE__REMOTE_LOG_CONN_ID=s3_connection BOILING_LAND_WEB_SERVICE_PORT_WEB=8080 AIRFLOW__CORE__DONOT_PICKLE=false BOILING_LAND_WEB_PORT=tcp://172.20.191.242:8080 PWD=/opt/airflow AIRFLOW_VERSION=1.10.10 AIRFLOW__SMTP__SMTP_MAIL_FROM=*snip* AWS_ROLE_ARN=*snip* AIRFLOW__CORE__LOAD_EXAMPLES=False TZ=Etc/UTC [email protected]:whize/airflow-dags.git AIRFLOW__KUBERNETES__GIT_DAGS_FOLDER_MOUNT_POINT=/opt/airflow/dags HOME=/home/airflow AIRFLOW__KUBERNETES__ENV_FROM_CONFIGMAP_REF=boiling-land-env LANG=C.UTF-8 KUBERNETES_PORT_443_TCP=tcp://172.20.0.1:443 AIRFLOW_HOME=/opt/airflow DATABASE_USER=postgres AIRFLOW__KUBERNETES__GIT_SSH_KEY_SECRET_NAME=airflow-kube-pods-git DATABASE_PORT=5432 GPG_KEY=*snip* AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABALES__AIRFLOW__CORE__REMOTE_LOGGING=True AIRFLOW__CORE__EXECUTOR=KubernetesExecutor AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://*snip* AIRFLOW__KUBERNETES__RUN_AS_USER=50000 AIRFLOW__CORE__BASE_LOG_FOLDER=/opt/airflow/logs AIRFLOW__CORE__DAG_PROCESSOR_MANAGER_LOG_LOCATION=/opt/airflow/logs/dag_processor_manager/dag_processor_manager.log AIRFLOW__CORE__ENABLE_XCOM_PICKLING=false TERM=xterm AIRFLOW__SCHEDULER__MAX_THREADS=8 AIRFLOW__KUBERNETES__WORKER_PODS_CREATION_BATCH_SIZE=5 AIRFLOW_CONN_S3_CONNECTION=aws:// AIRFLOW__KUBERNETES__WORKER_CONTAINER_REPOSITORY=apache/airflow DATABASE_DB=airflow AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG=1.10.10-python3.7 BOILING_LAND_WEB_PORT_8080_TCP_PROTO=tcp AIRFLOW__KUBERNETES__IN_CLUSTER=True DATABASE_PASSWORD=*snip* AIRFLOW_GID=50000 SHLVL=1 AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token KUBERNETES_PORT_443_TCP_PROTO=tcp BOILING_LAND_WEB_SERVICE_HOST=172.20.191.242 LC_MESSAGES=C.UTF-8 PYTHON_PIP_VERSION=20.0.2 KUBERNETES_PORT_443_TCP_ADDR=172.20.0.1 DATABASE_HOST=*snip* AIRFLOW__CORE__REMOTE_LOG_CONN_ID=s3_connection AIRFLOW__EMAIL__EMAIL_BACKEND=airflow.utils.email.send_email_smtp LC_CTYPE=C.UTF-8 AIRFLOW__SMTP__SMTP_STARTTLS=False AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME=boiling-land PYTHON_GET_PIP_SHA256=*snip* AIRFLOW__CORE__SQL_ALCHEMY_CONN=*snip* KUBERNETES_SERVICE_HOST=172.20.0.1 LC_ALL=C.UTF-8 AIRFLOW__CORE__REMOTE_LOGGING=True KUBERNETES_PORT=tcp://172.20.0.1:443 KUBERNETES_PORT_443_TCP_PORT=443 AIRFLOW_KUBERNETES_ENVIRONMENT_VARIABLES_KUBE_CLIENT_REQUEST_TIMEOUT_SEC=50 AIRFLOW__KUBERNETES__GIT_BRANCH=master PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/d59197a3c169cef378a22428a3fa99d33e080a5d/get-pip.py AIRFLOW__KUBERNETES__DELETE_WORKER_PODS=False PATH=/home/airflow/.local/bin:/home/airflow/.local/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin AIRFLOW__KUBERNETES__DAGS_VOLUME_SUBPATH=repo/ PYTHON_BASE_IMAGE=python:3.7-slim-buster AIRFLOW_UID=50000 AIRFLOW__CORE__FERNET_KEY=*snip* DEBIAN_FRONTEND=noninteractive BOILING_LAND_WEB_PORT_8080_TCP=tcp://172.20.191.242:8080 AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABALES__AIRFLOW__CORE__FERNET_KEY=*snip* AIRFLOW__SMTP__SMTP_SSL=False BOILING_LAND_WEB_PORT_8080_TCP_ADDR=172.20.191.242 AIRFLOW__SMTP__SMTP_HOST=email-smtp.us-east-1.amazonaws.com _=/usr/bin/env </details> - **Cloud provider or hardware configuration**: AWS EKS - **OS** (e.g. from /etc/os-release): NAME="Amazon Linux" VERSION="2" - **Kernel** (e.g. `uname -a`): Linux<AWS_INTERNAL_HOSTNAME>.x86_64 #1 SMP Thu May 7 18:48:23 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux - **Install tools**: - **Others**: **What happened**: Using the KubernetesExecutor a pod that prepares to launch a task running the KubernetesPodOperator is launched. This task fails due to an issue in the task definition such as an invalid option. The pod does not exit immediately it takes about 40minutes for it to exit after the failure and report its state in the UI. Interestingly the execution time is correctly listed as < 1 second. <!-- (please include exact error messages if you can) --> It also says that it is marking the job failed in the task logs on the launcher pod, which after about 40 minutes the state does change: <details> `[2020-08-03 04:28:32,844] {taskinstance.py:1202} INFO - Marking task as FAILED.dag_id=arxiv_crawler_pipeline, task_id=launch_crawl_pod, execution_date=20200803T042640, start_date=20200803T042652, end_date=20200803T042832 ` </details> The scheduler logs on the launcher pod say nothing about the failure though: <details> ``` [2020-08-03 04:26:51,543] {__init__.py:51} INFO - Using executor LocalExecutor [2020-08-03 04:26:51,544] {dagbag.py:396} INFO - Filling up the DagBag from /opt/airflow/dags/crawlers/arxiv/arxiv_crawl_pipeline.py /home/airflow/.local/lib/python3.7/site-packages/airflow/contrib/operators/kubernetes_pod_operator.py:159: PendingDeprecationWarning: Invalid arguments were passed to KubernetesPodOperator (task_id: launch_crawl_pod). Support for passing such arguments will be dropped in Airflow 2.0. Invalid arguments were: *args: () **kwargs: {'reattach_on_restart': True, 'log_events_on_failure': True} super(KubernetesPodOperator, self).__init__(*args, resources=None, **kwargs) /home/airflow/.local/lib/python3.7/site-packages/airflow/sensors/base_sensor_operator.py:71: PendingDeprecationWarning: Invalid arguments were passed to HttpSensor (task_id: wait_for_finish). Support for passing such arguments will be dropped in Airflow 2.0. Invalid arguments were: *args: () **kwargs: {'result_check': <function check_http_response at 0x7fdc378bf320>} super(BaseSensorOperator, self).__init__(*args, **kwargs) Running %s on host %s <TaskInstance: arxiv_crawler_pipeline.launch_crawl_pod 2020-08-03T04:26:40.850022+00:00 [queued]> arxivcrawlerpipelinelaunchcrawlpod-4c7e99ae14704b2b8fa0d64db508 ``` </details> **What you expected to happen**: The pod should exit immediately and report the failed task state in the metadata database which should then be reflected in the job UI in a much more timely fashion. <!-- What do you think went wrong? --> No idea, I've been looking at this for about 12 hours now and this report is my, I can't figure it out moment. **How to reproduce it**: Set up an airflow cluster on a Kubernetes cluster with the KubernetesExecutor and create a job that attempts to launch a KubernetesPodOperator task that will fail either in the attempt to launch or in the pod that is created by the task itself. How often does this problem occur? Once? Every time etc? Every single time. Any relevant logs to include? Put them here in side a detail tag: The logs don't really give any insight to why there is such a dramatic lag between failure and updating the metadata. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
