GitHub user Urus1201 added a comment to the discussion: Dag processor behavior 
at considered scale

Great questions — this is exactly the use case `random_seeded_by_host` was 
designed for.

## Can you run multiple DAG Processor replicas?

**Yes.** In Airflow 3, the DAG Processor is a standalone component that can be 
scaled horizontally. Run as many replicas as you like:

```yaml
# In a Kubernetes deployment, for example:
replicas: 2
```

## Does each replica process the same file twice? Is there locking?

**No double-processing by design** when you use `file_parsing_sort_mode: 
random_seeded_by_host`.

Each DAG Processor replica seeds its random shuffle with its own hostname, so 
the file processing order is different on every host. Over time, each file will 
be processed by *some* replica, but no replica is racing another to process the 
same file at the same instant (by statistical distribution, not hard locking).

There is **no pessimistic file-level lock** — if two replicas happen to parse 
the same file at the same time, both writes reach the metastore and the last 
writer wins (idempotent parse). This is safe because parsing is read-only on 
the filesystem and the resulting DagModel upsert is idempotent.

## Advice for your specific setup

Your configuration looks reasonable. A few tuning notes:

### 1. Two replicas is the sweet spot for Azure Files latency

Azure Files NFS/SMB shares have higher metadata latency than local NVMe. With 
thousands of files and 1 replica at `parsing_processes: 16`, you're likely 
I/O-bound on directory traversal. Adding a second replica doubles effective 
parse throughput without coordination overhead.

### 2. Tune `min_file_process_interval` relative to `dag_file_processor_timeout`

Your current settings:
```yaml
min_file_process_interval: 600   # re-parse at most once per 10 min
dag_file_processor_timeout: 600  # kill parser after 10 min
```

These are back-to-back with no buffer. If a slow DAG file hits the timeout 
(600s), it immediately becomes eligible for re-parse. Consider:
```yaml
min_file_process_interval: 900   # 15 min — gives breathing room
dag_file_processor_timeout: 600  # keep at 10 min
stale_dag_threshold: 600         # increase to match
```

### 3. Separate slow-parsing DAGs if possible

With "some DAGs parse faster, some slower", the random distribution means a 
slow DAG might land on the same replica as another slow DAG, creating a parse 
queue. If you can identify the slow files, moving them to a separate DAG bundle 
with a dedicated processor (Airflow 3 supports multiple bundles) will reduce 
tail latency.

### 4. Memory consideration

16 parsing processes on an 8 vCPU / 16 GiB node means each process gets ~1 GiB 
RAM. If your DAGs with top-level code import heavy libraries (pandas, sklearn, 
etc.), you will see OOM kills. Try:
```yaml
parsing_processes: 8  # one per vCPU, avoids memory pressure
```

## Summary

| Question | Answer |
|---|---|
| Multiple replicas supported? | ✅ Yes |
| Same file processed twice? | Usually no — `random_seeded_by_host` distributes 
naturally |
| Hard locking? | No — last-write-wins is safe (idempotent) |
| Recommended replicas for your setup | 2 |
| Main risk | Memory pressure with 16 parse processes on 16 GiB |

GitHub link: 
https://github.com/apache/airflow/discussions/64944#discussioncomment-16503553

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to