GitHub user Urus1201 added a comment to the discussion: Dag processor behavior at considered scale
Great questions — this is exactly the use case `random_seeded_by_host` was designed for. ## Can you run multiple DAG Processor replicas? **Yes.** In Airflow 3, the DAG Processor is a standalone component that can be scaled horizontally. Run as many replicas as you like: ```yaml # In a Kubernetes deployment, for example: replicas: 2 ``` ## Does each replica process the same file twice? Is there locking? **No double-processing by design** when you use `file_parsing_sort_mode: random_seeded_by_host`. Each DAG Processor replica seeds its random shuffle with its own hostname, so the file processing order is different on every host. Over time, each file will be processed by *some* replica, but no replica is racing another to process the same file at the same instant (by statistical distribution, not hard locking). There is **no pessimistic file-level lock** — if two replicas happen to parse the same file at the same time, both writes reach the metastore and the last writer wins (idempotent parse). This is safe because parsing is read-only on the filesystem and the resulting DagModel upsert is idempotent. ## Advice for your specific setup Your configuration looks reasonable. A few tuning notes: ### 1. Two replicas is the sweet spot for Azure Files latency Azure Files NFS/SMB shares have higher metadata latency than local NVMe. With thousands of files and 1 replica at `parsing_processes: 16`, you're likely I/O-bound on directory traversal. Adding a second replica doubles effective parse throughput without coordination overhead. ### 2. Tune `min_file_process_interval` relative to `dag_file_processor_timeout` Your current settings: ```yaml min_file_process_interval: 600 # re-parse at most once per 10 min dag_file_processor_timeout: 600 # kill parser after 10 min ``` These are back-to-back with no buffer. If a slow DAG file hits the timeout (600s), it immediately becomes eligible for re-parse. Consider: ```yaml min_file_process_interval: 900 # 15 min — gives breathing room dag_file_processor_timeout: 600 # keep at 10 min stale_dag_threshold: 600 # increase to match ``` ### 3. Separate slow-parsing DAGs if possible With "some DAGs parse faster, some slower", the random distribution means a slow DAG might land on the same replica as another slow DAG, creating a parse queue. If you can identify the slow files, moving them to a separate DAG bundle with a dedicated processor (Airflow 3 supports multiple bundles) will reduce tail latency. ### 4. Memory consideration 16 parsing processes on an 8 vCPU / 16 GiB node means each process gets ~1 GiB RAM. If your DAGs with top-level code import heavy libraries (pandas, sklearn, etc.), you will see OOM kills. Try: ```yaml parsing_processes: 8 # one per vCPU, avoids memory pressure ``` ## Summary | Question | Answer | |---|---| | Multiple replicas supported? | ✅ Yes | | Same file processed twice? | Usually no — `random_seeded_by_host` distributes naturally | | Hard locking? | No — last-write-wins is safe (idempotent) | | Recommended replicas for your setup | 2 | | Main risk | Memory pressure with 16 parse processes on 16 GiB | GitHub link: https://github.com/apache/airflow/discussions/64944#discussioncomment-16503553 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
