gabriel-attie commented on issue #41340:
URL: https://github.com/apache/airflow/issues/41340#issuecomment-2283867696
@josix:
1 - The DAGs doesn't really matter since they disappear randomly. But here
is one example:
`import logging
from airflow.decorators import dag, task
from airflow.models.param import Param
from airflow.operators.python import get_current_context
from common.tasks.general import trigger_another_dag
from common.tasks.teams import notify_failure
from common.services.etl import ETLService
from common.settings.car import *
from common.settings.dags import default_args
from common.settings.monitoring import UPDATE_RUNNING, WEEKLY
from common.settings.envs import START_DATE
logger = logging.getLogger("airflow.task")
etl_service = ETLService()
@dag(
default_args=default_args,
schedule_interval="@weekly",
start_date=START_DATE,
catchup=False,
tags=["public"],
params={
"ignore_discrepancy": Param(False, type="boolean"),
"emergency_mode": Param(None, type=["null", "string"])
},
)
def update_car_extract():
@task(on_failure_callback=notify_failure)
def main_run_attrs() -> dict:
"""Core function Create monitoring instance
:return: dict with data to monitoring this project
"""
from common.models.car import DataModel
context = get_current_context()
return etl_service.start_monitoring(
DataModel, context, UPDATE_RUNNING, WEEKLY
)
@task(on_failure_callback=notify_failure)
def download_file_to_s3() -> str:
"""Core function to downloads files from source directly to S3.
It can be set to run in emergency mode by a DAG conf.
:return: string with s3 path to raw data
"""
from common.models.sema_mt_car import DataModel
context = get_current_context()
# Set the url to download
source_urls = {"zip": SOURCE}
logger.info(f"The Source URLS: {source_urls}")
result_download = etl_service.download_file(
context, DataModel, source_urls, use_raw=False
)
return result_download
dag_conf = {
"main_run_attrs": main_run_attrs(),
"raw_data_zip_path": download_file_to_s3(),
}
trigger_another_dag(
dag_conf,
"trigger_transform_dag",
"update_car_transform",
)
update_car_extract()`
2 - Dockerfile
`FROM apache/airflow:2.9.3-python3.11
COPY requirements.txt /requirements.txt
RUN pip install --upgrade pip --trusted-host pypi.org --trusted-host
files.pythonhosted.org
RUN pip install --no-cache-dir -r /requirements.txt --trusted-host pypi.org
--trusted-host files.pythonhosted.org
USER root
RUN apt-get update && \
apt-get install --allow-downgrades -y libpq5=15.6-0+deb12u1
libmariadb3=1:10.11.6-0+deb12u1
RUN apt-get install -y libgdal-dev \
gdal-bin \
gcc \
g++
RUN sudo apt-get install unrar-free -y
RUN sudo pip install geopandas --trusted-host pypi.org --trusted-host
files.pythonhosted.org
RUN sudo pip install --global-option=build_ext
--global-option="-I/usr/include/gdal" GDAL==`gdal-config --version`
--trusted-host pypi.org --trusted-host files.pythonhosted.org
RUN sudo pip install --no-cache-dir rasterio --trusted-host pypi.org
--trusted-host files.pythonhosted.org
RUN apt-get clean
USER airflow`
3 - docker-compose.yaml
`x-airflow-common:
&airflow-common
# In order to add custom dependencies or upgrade provider packages you can
use your extended image.
# Comment the image line, place your Dockerfile in the directory where you
placed the docker-compose.yaml
# and uncomment the "build" line below, Then run `docker-compose build` to
build the images.
image: my-tag:latest
# build: .
environment:
&airflow-common-env
AIRFLOW__CORE__EXECUTOR: LocalExecutor
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN:
postgresql+psycopg2://airflow:airflow@postgres/airflow
AIRFLOW__CORE__FERNET_KEY: ''
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
AIRFLOW__API__AUTH_BACKENDS:
'airflow.api.auth.backend.basic_auth,airflow.api.auth.backend.session'
AIRFLOW__WEBSERVER__SHOW_TRIGGER_FORM_IF_NO_PARAMS: 'true'
AIRFLOW__WEBSERVER__EXPOSE_CONFIG: 'true'
AIRFLOW__CORE__DEFAULT_TIMEZONE: 'America/Sao_Paulo'
AIRFLOW__WEBSERVER__DAG_ORIENTATION: 'TB'
AIRFLOW__LOGGING__COLORED_CONSOLE_LOG: 'true'
AIRFLOW__SCHEDULER__SCHEDULER_ZOMBIE_TASK_THRESHOLD: 600
# yamllint disable rule:line-length
# Use simple http server on scheduler for health checks
# See
https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/check-health.html#scheduler-health-check-server
# yamllint enable rule:line-length
AIRFLOW__SCHEDULER__ENABLE_HEALTH_CHECK: 'true'
# WARNING: Use _PIP_ADDITIONAL_REQUIREMENTS option ONLY for a quick
checks
# for other purpose (development, test and especially production usage)
build/extend Airflow image.
_PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}
# The following line can be used to set a custom config file, stored in
the local config folder
# If you want to use it, outcomment it and replace airflow.cfg with the
name of your config file
# AIRFLOW_CONFIG: '/opt/airflow/config/airflow.cfg'
volumes:
- ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
- ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs
- ${AIRFLOW_PROJ_DIR:-.}/config:/opt/airflow/config
- ${AIRFLOW_PROJ_DIR:-.}/plugins:/opt/airflow/plugins
- ${AIRFLOW_PROJ_DIR:-.}/common:/opt/airflow/plugins/common
- $HOME/.aws:/home/airflow/.aws
user: "${AIRFLOW_UID:-50000}:0"
depends_on:
&airflow-common-depends-on
postgres:
condition: service_healthy
services:
postgres:
image: postgis/postgis:13-3.4
platform: linux/amd64
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
volumes:
- postgres-db-volume:/var/lib/postgresql/data
ports:
- "5432:5432"
healthcheck:
test: ["CMD", "pg_isready", "-U", "airflow"]
interval: 10s
retries: 5
start_period: 5s
restart: always
airflow-webserver:
<<: *airflow-common
command: webserver
ports:
- "8080:8080"
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
restart: always
depends_on:
<<: *airflow-common-depends-on
airflow-init:
condition: service_completed_successfully
airflow-scheduler:
<<: *airflow-common
command: scheduler
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8974/health"]
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
restart: always
depends_on:
<<: *airflow-common-depends-on
airflow-init:
condition: service_completed_successfully
airflow-triggerer:
<<: *airflow-common
command: triggerer
healthcheck:
test: ["CMD-SHELL", 'airflow jobs check --job-type TriggererJob
--hostname "$${HOSTNAME}"']
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
restart: always
depends_on:
<<: *airflow-common-depends-on
airflow-init:
condition: service_completed_successfully
airflow-init:
<<: *airflow-common
entrypoint: /bin/bash
# yamllint disable rule:line-length
command:
- -c
- |
if [[ -z "${AIRFLOW_UID}" ]]; then
echo
echo -e "\033[1;33mWARNING!!!: AIRFLOW_UID not set!\e[0m"
echo "If you are on Linux, you SHOULD follow the instructions
below to set "
echo "AIRFLOW_UID environment variable, otherwise files will be
owned by root."
echo "For other operating systems you can get rid of the warning
with manually created .env file:"
echo " See:
https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#setting-the-right-airflow-user"
echo
fi
one_meg=1048576
mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) /
one_meg))
cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat)
disk_available=$$(df / | tail -1 | awk '{print $$4}')
warning_resources="false"
if (( mem_available < 4000 )) ; then
echo
echo -e "\033[1;33mWARNING!!!: Not enough memory available for
Docker.\e[0m"
echo "At least 4GB of memory required. You have $$(numfmt --to iec
$$((mem_available * one_meg)))"
echo
warning_resources="true"
fi
if (( cpus_available < 2 )); then
echo
echo -e "\033[1;33mWARNING!!!: Not enough CPUS available for
Docker.\e[0m"
echo "At least 2 CPUs recommended. You have $${cpus_available}"
echo
warning_resources="true"
fi
if (( disk_available < one_meg * 10 )); then
echo
echo -e "\033[1;33mWARNING!!!: Not enough Disk space available for
Docker.\e[0m"
echo "At least 10 GBs recommended. You have $$(numfmt --to iec
$$((disk_available * 1024 )))"
echo
warning_resources="true"
fi
if [[ $${warning_resources} == "true" ]]; then
echo
echo -e "\033[1;33mWARNING!!!: You have not enough resources to
run Airflow (see above)!\e[0m"
echo "Please follow the instructions to increase amount of
resources available:"
echo "
https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#before-you-begin"
echo
fi
mkdir -p /sources/logs /sources/dags /sources/plugins /sources/common
chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins,common}
exec /entrypoint airflow version
# yamllint enable rule:line-length
environment:
<<: *airflow-common-env
_AIRFLOW_DB_MIGRATE: 'true'
_AIRFLOW_WWW_USER_CREATE: 'true'
_AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
_AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
_PIP_ADDITIONAL_REQUIREMENTS: ''
user: "0:0"
volumes:
- ${AIRFLOW_PROJ_DIR:-.}:/sources
airflow-cli:
<<: *airflow-common
profiles:
- debug
environment:
<<: *airflow-common-env
CONNECTION_CHECK_MAX_COUNT: "0"
# Workaround for entrypoint issue. See:
https://github.com/apache/airflow/issues/16252
command:
- bash
- -c
- airflow
# You can enable flower by adding "--profile flower" option e.g.
docker-compose --profile flower up
# or by explicitly targeted on the command line e.g. docker-compose up
flower.
# See: https://docs.docker.com/compose/profiles/
volumes:
postgres-db-volume:
`
4 - There are no specific conditions in where the dogs go missing. I do
suspect thought on the docker compose down and up too frequently.
In the moment I do not have screenshots showing the how the files goes
missing in the web server. But it literally just goes missing, from 16 DAGs for
example, I refresh the page (F5) and it's now with 14 DAGs.
For context: in our dev and production environment this does not occur. Only
in the local environment. Usually in the local I have around 30 DAGs and in
production we have around 300+ with codes going to 2k lines.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]