potiuk commented on PR #41870:
URL: https://github.com/apache/airflow/pull/41870#issuecomment-2323080871

   It's an interesting one. And yeah - nice analysis @Kytha ! 
   
   Looks like the original code was supposed to handle "quick" tasks - but It 
did not take into account that - apparently - in order to check if result is 
available, celery will indeed make a db call for EVERY async result - no matter 
what state the task is - whether it is submitted, running, or completec.. 
That's a very nasty side-effect if that's confirmed.
   
   But I'd be rather suprised to see such "huge" inefficiency of this check - I 
imagine it would have been found already if that was a "general" problem. 
   
   @Kytha - question - was your pyspy testing done with "real" tasks or ones 
that were doing nothing (i.e. pass-only tasks or similar)? 
   
   One reasons why you see that bad numbers beccause you have a very quick 
(`pass`) celery tasks and some of them manage to complete very quickly (but I 
do not know celery internals).
   
   It would be rather inefficient way of checking if task completed by calling 
a DB for every single task. Intuitively would imagine that celery should 
**not** call the DB if task did not complete yet (but maybe this is how it is 
implemented and there is no other side-communication to check it).  I'd imagine 
you need to retrieve the result from the DB when task completed, because of 
persistency and recovery, but if the task is still "running" - I hardly see the 
need to reach out to the result backend (I imagine celery could check in the 
queue if the task was picked up and acknowledged way faster - without reaching 
out to the backend). 
   
   One reason why you would see so big "percentage" of time there is that you 
made the tests with "pass-only" celery tasks and quite a big number of those 
tasks manage to complete before the check happens (but this is just a 
hypothesis).
   
   Can you elaborate on that ?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to