potiuk commented on PR #41870: URL: https://github.com/apache/airflow/pull/41870#issuecomment-2323080871
It's an interesting one. And yeah - nice analysis @Kytha ! Looks like the original code was supposed to handle "quick" tasks - but It did not take into account that - apparently - in order to check if result is available, celery will indeed make a db call for EVERY async result - no matter what state the task is - whether it is submitted, running, or completec.. That's a very nasty side-effect if that's confirmed. But I'd be rather suprised to see such "huge" inefficiency of this check - I imagine it would have been found already if that was a "general" problem. @Kytha - question - was your pyspy testing done with "real" tasks or ones that were doing nothing (i.e. pass-only tasks or similar)? One reasons why you see that bad numbers beccause you have a very quick (`pass`) celery tasks and some of them manage to complete very quickly (but I do not know celery internals). It would be rather inefficient way of checking if task completed by calling a DB for every single task. Intuitively would imagine that celery should **not** call the DB if task did not complete yet (but maybe this is how it is implemented and there is no other side-communication to check it). I'd imagine you need to retrieve the result from the DB when task completed, because of persistency and recovery, but if the task is still "running" - I hardly see the need to reach out to the result backend (I imagine celery could check in the queue if the task was picked up and acknowledged way faster - without reaching out to the backend). One reason why you would see so big "percentage" of time there is that you made the tests with "pass-only" celery tasks and quite a big number of those tasks manage to complete before the check happens (but this is just a hypothesis). Can you elaborate on that ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
