potiuk edited a comment on issue #9755:
URL: https://github.com/apache/airflow/issues/9755#issuecomment-695775260


   > You can define multiple pools for one task.
   
   But is not that abuse of pools? Pool is a feature taken from celery + now it 
is also used for CeleryKubernetesExcecutor to distinguish "big" tasks that you 
can send to Kubernetes from "small" tasks that you want to send to Celery. 
   
   Surely we can use it for hundred other things but should we? Is it 
discoverable use case? are we going to extend the documentation of pools to 
include that case as well? The fact that we **can** do something does not 
always mean that we **should** do it.
   
   > That way the DAG itself left untouched
   
   I think in the proposal the most important "feature" of it is (what also 
@eladkal wrote) that you do not have to write your DAG in a special way if you 
would like to drain connections. In the "pools" case you would have to remember 
to add specially crafted pools to every task that uses some connection (one 
pool per connection!) and you would have to keep somewhere a mapping -> this 
pool should be drained to 0 when you want to drain that connection. That's 
probably quite on the far side of being user-friendly.
   
   I like the idea that when you want to drain the connection - you do 
something (disable? drain?) about ..... the connection and not some other 
abstract pool that you have to remember to add when you write the DAG.
   
   
   >  if it can't access the connection it will act in a sensor like way 
(release worker and try reschedule later).
   
   I also think the kind of "reschedule" mode is not the best idea because 
rather than on the DAG level, it would require to add similar logic in pretty 
much all tasks using the connection (check connection, if not available - 
reschedule). Doing it at the scheduler time might be far more efficient way of 
dong it without sending the task to the executor. And In case of Kubernetes 
Executor - without the overhead of creating whole new pod, just to check the 
connection and shut-down with reschedule.
   
   I think this one has also a bit further reaching implications, if we 
implement it. In order to do that, we must have an information about the 
connection used by the Task. This is not very difficult, and it only requires 
the task to know whcih hooks are being used be it (assuming that each hook has 
single connection). While we do not have this information now AFAIK, I think it 
would be worthwhile to consider adding it, because it can lead to much nicer 
lineage information tracking.
   
   It can be an interesting building block and base to implement better 
"lineage" information in the long term. If the tasks know which connections 
they are using, and for each connection they register the "resource" they are 
using, that's an interesting way to auto-discover the lineage information - 
i.e. how data passes through the whole DAG. 
   
   For me this one looks like a rather interesting candidate for a feature for 
Airflow 2.1.
   
   
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to