potiuk edited a comment on issue #9755: URL: https://github.com/apache/airflow/issues/9755#issuecomment-695775260
> You can define multiple pools for one task. But is not that abuse of pools? Pool is a feature taken from celery + now it is also used for CeleryKubernetesExcecutor to distinguish "big" tasks that you can send to Kubernetes from "small" tasks that you want to send to Celery. Surely we can use it for hundred other things but should we? Is it discoverable use case? are we going to extend the documentation of pools to include that case as well? The fact that we **can** do something does not always mean that we **should** do it. > That way the DAG itself left untouched I think in the proposal the most important "feature" of it is (what also @eladkal wrote) that you do not have to write your DAG in a special way if you would like to drain connections. In the "pools" case you would have to remember to add specially crafted pools to every task that uses some connection (one pool per connection!) and you would have to keep somewhere a mapping -> this pool should be drained to 0 when you want to drain that connection. That's probably quite on the far side of being user-friendly. I like the idea that when you want to drain the connection - you do something (disable? drain?) about ..... the connection and not some other abstract pool that you have to remember to add when you write the DAG. > if it can't access the connection it will act in a sensor like way (release worker and try reschedule later). I also think the kind of "reschedule" mode is not the best idea because rather than on the DAG level, it would require to add similar logic in pretty much all tasks using the connection (check connection, if not available - reschedule). Doing it at the scheduler time might be far more efficient way of dong it without sending the task to the executor. And In case of Kubernetes Executor - without the overhead of creating whole new pod, just to check the connection and shut-down with reschedule. I think this one has also a bit further reaching implications, if we implement it. In order to do that, we must have an information about the connection used by the Task. This is not very difficult, and it only requires the task to know whcih hooks are being used be it (assuming that each hook has single connection). While we do not have this information now AFAIK, I think it would be worthwhile to consider adding it, because it can lead to much nicer lineage information tracking. It can be an interesting building block and base to implement better "lineage" information in the long term. If the tasks know which connections they are using, and for each connection they register the "resource" they are using, that's an interesting way to auto-discover the lineage information - i.e. how data passes through the whole DAG. For me this one looks like a rather interesting candidate for a feature for Airflow 2.1. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
