potiuk edited a comment on issue #9755:
URL: https://github.com/apache/airflow/issues/9755#issuecomment-695775260


   > You can define multiple pools for one task.
   
   But is not that abuse of pools? The pool is a feature taken from celery + 
now it is also used for CeleryKubernetesExcecutor to distinguish "big" tasks 
that you can send to Kubernetes from "small" tasks that you want to send to 
Celery. 
   
   Surely we can use it for a hundred other things but should we? Is it a 
discoverable use case? are we going to extend the documentation of pools to 
include that case as well? The fact that we **can** do something does not 
always mean that we **should** do it.
   
   > That way the DAG itself left untouched
   
   I think in the proposal the most important "feature" of it is (what also 
@eladkal wrote) that you do not have to write your DAG in a special way if you 
would like to drain connections. In the "pools" case you would have to remember 
to add specially crafted pools to every task that uses some connection (one 
pool per connection!) and you would have to keep somewhere a mapping -> this 
pool should be drained to 0 when you want to drain that connection. That's 
probably quite on the far side of being user-friendly.
   
   I like the idea that when you want to drain the connection - you do 
something (disable? drain?) about ..... the connection and not some other 
abstract pool that you have to remember to add when you write the DAG.
   
   
   >  if it can't access the connection it will act in a sensor like the way 
(release worker and try to reschedule later).
   
   I also think the kind of "reschedule" mode is not the best idea because 
rather than on the DAG level, it would require to add similar logic in pretty 
much all tasks using the connection (check the connection, if not available - 
reschedule). Doing it at the scheduler time might be a far more efficient way 
of doing it without sending the task to the executor. And In case of Kubernetes 
Executor - without the overhead of creating a whole new pod, just to check the 
connection and shut-down with rescheduling.
   
   I think this one has also a bit further reaching implications if we 
implement it. In order to do that, we must have information about the 
connection used by the Task. This is not very difficult, and it only requires 
the task to know which hooks are being used by it (assuming that each hook has 
a single connection). While we do not have this information now AFAIK, I think 
it would be worthwhile to consider adding it, because it can lead to much nicer 
lineage information tracking.
   
   It can be an interesting building block and base to implement better 
"lineage" information in the long term. If the tasks know which connections 
they are using, and for each connection, they register the "resource" they are 
using, that's an interesting way to auto-discover the lineage information - 
i.e. how data passes through the whole DAG. 
   
   For me this one looks like a rather interesting candidate for a feature for 
Airflow 2.1.
   
   
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to