repl-chris commented on PR #23432:
URL: https://github.com/apache/airflow/pull/23432#issuecomment-1120081308

   > For that reason I think I'd like you to look at If we can see if the 
taskmeta is _actually_ lost, to separate "lost" from "just in a queue behind n 
long running tasks."
   
   @ashb Yes, this would be great, I agree. I'm far from a celery expert but as 
far as I can tell celery does not provide any capability to do this. Any 
implementation would need to be broker-specific, and directly access celery 
internals, reading pending work items from the queue directly. This is not 
possible on SQS or rabbit afaik...but I've only witnessed this problem on 
redis, so maybe the fix only needs to work on redis. Even on redis the solution 
isn't pretty though...
   
   The redis queue in celery is implemented using a redis "list" data type, 
which, internally is a linked-list. So, there's no great way to check if it 
contains a given task - we need to scan the entire list O(n) to check if a 
single task is in there. Redis does provide an `LPOS` command which could do 
this scan for us, but it's still O(n), and it's only in newer versions of redis 
(>=6.0.6). The list contains serialized JSON documents for the work items, so 
to use LPOS we'd need to synthetically generate a character-perfect copy of 
that JSON document, which uses an undocumented format. I think in theory it may 
be possible to use LUA scripting to do this scan...I'm extremely open to other 
ideas, but I think the only way we could realistically make this work decently 
is to grab our oldest known "PENDING" task_id, send a LUA script to redis which 
scans the list doing a `string.find(list_element, '"task_id"="our-task-id"')` 
sorta thing.....and then hope celery doesn't change how it works
  😛  This whole direction sounds like a pretty bad idea IMHO...but I could go 
in this direction if we actually want to go there...
   
   Is rescheduling a "just-in-a-long-queue" task really that bad? Maybe I don't 
fully grasp the impacts of it, but revoking and re-enqueuing a task which was 
healthy and waiting seems like a pretty good option compared to the alternative 
😄
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to