potiuk commented on PR #29401: URL: https://github.com/apache/airflow/pull/29401#issuecomment-1423860810
> +1 -- Agreed with @jedcunningham . That is one of the things we said the user-community chart did that we would never do when we start building the official helm chart for Airflow (or so I remember :) ) +1 too. Managing pools this way is a bad idea for the reasons @jedcunningham nicely laid out. > We had discussion on that. See https://github.com/apache/airflow/issues/18582 As @eladkal mentioned, this would have to redefine the way our SQL queries are done in scheduler. There is a certain complexity of Pools in multi-scheduler context that makes it a bit complex (but not impossible I think). I once assesed this might be hard. But I looked closely how it works and I **think** possibly we could attempt to change how Pools table is used. I think either code was different when I looked at it in #18582, or maybe I have not thought about this idea before, but let me explain the context: * Pool table is currenlty locked during scheduling as critical section in `_executable_task_instances_to_queued`: ``` # Get the pool settings. We get a lock on the pool rows, treating this as a "critical section" # Throws an exception if lock cannot be obtained, rather than blocking pools = Pool.slots_stats(lock_rows=True, session=session) ``` * And the `slot_stats` method uses pool to lock the pool table ``` query = session.query(Pool.pool, Pool.slots) ```` What happen next in the method then it builds the "PoolStats" dict and marks tasks as executable if they are still have "free slots" (but it uses the in-memory Dict). Only then it releases the lock on Pools. This is - in order to accomodate multiple schedulers and the critical section is to make sure schedulers will not mark dag runs as executable when Pools would be over-subscribed. So while indeed Pools table is locked and used in the query, fundamentally I think we could take the "total" values from ENV variables rather than from Pools table. This is somewhat coincidental that we use "totals" from the Pools table - it's convenient and the code already has Pool objects returned by the query to build PoolStats, but not **really** necessary. The main usage of the Pool table is that whole table is locked while the critical section happens. Locking Pool table also has nice side-effect that when we use UI or CLI to modify pools, we won't change the pool size while scheduler keeps the table locked in critical section - so pool size update will never happen while scheduler is in it. But with env variables it is impossible to change it without restarting airflow, so it is not really an argument if we use them. Or we even could have a separate lock for this ciricial section in `_executable_task_instances_to_queued` - it does not have to be Pools table lock. There is no referential identity between Pools and any other table it seems so I think we would just have to modify slots_stats function to retrieve totals from elsewhere (env vars) and things would work. The `pool` column in TaskInstance is just string (part of an index but not part of referential integrity with Pool table). ``` pool = Column(String(256), nullable=False) ``` I still do not like the "multiple source of truth" in this case. But there is potentially an option that we could disable the pool counts in the UI entirely and replace them with pool counts. managed entirely in env vars only. Also I think changing the approach in this area is something that we want anyway so we could kill two birds with the same stone - see https://github.com/apache/airflow/issues/29416 - our pool count currently does not currently take into account deferred tasks. but there are cases where it would be desired. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
