repl-chris commented on PR #23432: URL: https://github.com/apache/airflow/pull/23432#issuecomment-1120081308
> For that reason I think I'd like you to look at If we can see if the taskmeta is _actually_ lost, to separate "lost" from "just in a queue behind n long running tasks." @ashb Yes, this would be great, I agree. I'm far from a celery expert but as far as I can tell celery does not provide any capability to do this. Any implementation would need to be broker-specific, and directly access celery internals, reading pending work items from the queue directly. This is not possible on SQS or rabbit afaik...but I've only witnessed this problem on redis, so maybe the fix only needs to work on redis. Even on redis the solution isn't pretty though... The redis queue in celery is implemented using a redis "list" data type, which, internally is a linked-list. So, there's no great way to check if it contains a given task - we need to scan the entire list O(n) to check if a single task is in there. Redis does provide an `LPOS` command which could do this scan for us, but it's still O(n), and it's only in newer versions of redis (>=6.0.6). The list contains serialized JSON documents for the work items, so to use LPOS we'd need to synthetically generate a character-perfect copy of that JSON document, which uses an undocumented format. I think in theory it may be possible to use LUA scripting to do this scan...I'm extremely open to other ideas, but I think the only way we could realistically make this work decently is to grab our oldest known "PENDING" task_id, send a LUA script to redis which scans the list doing a `string.find(list_element, '"task_id"="our-task-id"')` sorta thing.....and then hope celery doesn't change how it works 😛 This whole direction sounds like a pretty bad idea IMHO...but I could go in this direction if we actually want to go there... Is rescheduling a "just-in-a-long-queue" task really that bad? Maybe I don't fully grasp the impacts of it, but revoking and re-enqueuing a task which was healthy and waiting seems like a pretty good option compared to the alternative 😄 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
