pchang388 commented on issue #12701: URL: https://github.com/apache/druid/issues/12701#issuecomment-1178206688
#1 Based off my findings in my cluster (your cluster may be different but could be seeing something similar), this is a problem that has been around with Druid for a while or at least appears to be dating back to version: "druid-12.2-rc3" Similar issues that were opened from other users without resolution or real engagement from community, there's probably more but based off my key search terms: 1. https://github.com/apache/druid/issues/11015 2. https://github.com/apache/druid/issues/10607 3. https://github.com/apache/druid/issues/7378 ***In our specific case, the issue stems from the Supervisor/Overlord asking the running Kafka task to "pause" (which does seem to happen frequently), and usually these "pause" requests (IPC method which uses HTTP) go through fine but often the Peon can respond with "202 Accepted" instead of the usual "200 OK" or "400 Bad Request". Although it's not clear to me yet why 202 was issued and it never actually paused. In our case, it stayed in "STARTING" phase when responding back to "/status" HTTP calls after the 202 and seemed to remain there until killed due to it switching to the "PUBLISHING" phase when another "/pause" request came. According to the code/logs, PUBLISHING is an unpausable state and throws an exception*** Some Relevant info below: * Found here - https://github.com/apache/druid/blob/master/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java (See lines 1807-1816) * According to the comments: ``` /** * Signals the ingestion loop to pause. * * @return one of the following Responses: 400 Bad Request if the task has started publishing; 202 Accepted if the * method has timed out and returned before the task has paused; 200 OK with a map of the current partition sequences * in the response body if the task successfully paused */ ``` * In my task example, this log snippet is where it asked to "pause" and responded with 202. I also noticed there is a "pause" request right before this one. The initial pause request before the 202 comes at ~50 minutes after task start and was successful (200 OK) and it is then resumed and shows as "STARTING" phase: ``` ... 2022-07-06T21:56:35,617 DEBUG [qtp323665272-156] org.eclipse.jetty.server.HttpOutput - write(array) s=OPEN,api=BLOCKING,sc=false,e=null aggregated !flush HeapByteBuffer@5760b9c0[p=0,l=44,c=32768,r=44]={<<<Request accepted but task has not yet paused>>>2-07-06T2...\x00\x00\x00\x00\x00\x00\x00} 2022-07-06T21:56:35,617 DEBUG [qtp323665272-156] org.eclipse.jetty.server.handler.gzip.GzipHttpOutputInterceptor - org.eclipse.jetty.server.handler.gzip.GzipHttpOutputInterceptor@299973cb compressing java.util.zip.Deflater@52483be5 2022-07-06T21:56:35,617 DEBUG [qtp323665272-156] org.eclipse.jetty.server.HttpChannel - sendResponse info=null content=HeapByteBuffer@716022[p=0,l=10,c=32768,r=10]={<<<\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00>>>\x00o\xA1\xD0\r\x8b\x06\xC7\x85...\x00\x00\x00\x00\x00\x00\x00} complete=false committing=true callback=GzipBufferCB@e12c8b4[content=HeapByteBuffer@5760b9c0[p=44,l=44,c=32768,r=0]={Request a... paused<<<>>>2-07-06T2...\x00\x00\x00\x00\x00\x00\x00} last=false copy=null buffer=HeapByteBuffer@716022[p=0,l=10,c=32768,r=10]={<<<\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00>>>\x00o\xA1\xD0\r\x8b\x06\xC7\x85...\x00\x00\x00\x00\x00\x00\x00} deflate=java.util.zip.Deflater@52483be5 ] 2022-07-06T21:56:35,617 DEBUG [qtp323665272-156] org.eclipse.jetty.server.HttpChannel - COMMIT for /druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause on HttpChannelOverHttp@427bc76a{s=HttpChannelState@2d006490{s=HANDLING rs=BLOCKING os=COMMITTED is=IDLE awp=false se=false i=true al=0},r=1,c=false/false,a=HANDLING,uri=//REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause,age=2002} 202 Accepted HTTP/1.1 Date: Wed, 06 Jul 2022 21:56:33 GMT X-Druid-Task-Id: index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc Content-Type: application/json Vary: Accept-Encoding, User-Agent Content-Encoding: gzip ... ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
