arjav1528 opened a new pull request, #60596:
URL: https://github.com/apache/airflow/pull/60596

   **Closes:** #60261
   
   ## Summary
   
   This PR fixes the intermittent `405 Method Not Allowed` error that edge 
workers encounter during startup in high-concurrency environments (e.g., 8 API 
servers × 64 worker processes).
   
   ## Root Cause
   
   The edge executor plugin was acquiring a global database lock 
(`DBLocks.MIGRATIONS`) **at class definition time** (plugin import) to create 
tables. When many API server processes started simultaneously, they all 
competed for the same lock, causing:
   
   1. `psycopg2.errors.LockNotAvailable` errors on lock timeout
   2. Plugin import failures → edge worker endpoints not registered
   3. Edge workers receiving `405 Method Not Allowed` for 
`/edge_worker/v1/worker/`
   
   ## Solution
   - Moved table creation from plugin import time to a **FastAPI startup event**
   - Plugin import no longer blocks on database operations
   - Added retry mechanism (5 attempts) for lock acquisition
   - Exponential backoff with jitter to avoid thundering herd
   - Proper logging for debugging
   - Check if tables exist **before** attempting to acquire lock
   - Skip lock entirely during normal operation (tables already exist)
   - Double-check pattern after lock acquisition to handle race conditions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to