pchang388 commented on issue #12701:
URL: https://github.com/apache/druid/issues/12701#issuecomment-1178206688

   #1 
   
   Based off my findings in my cluster (your cluster may be different but could 
be seeing something similar), this is a problem that has been around with Druid 
for a while or at least appears to be dating back to version: "druid-12.2-rc3"
   
   Similar issues that were opened from other users without resolution or real 
engagement from community, there's probably more but based off my key search 
terms:
   1. https://github.com/apache/druid/issues/11015
   2. https://github.com/apache/druid/issues/10607
   3. https://github.com/apache/druid/issues/7378
   
   ***In our specific case, the issue stems from the Supervisor/Overlord asking 
the running Kafka task to "pause" (which does seem to happen frequently), and 
usually these "pause" requests (IPC method which uses HTTP) go through fine but 
often the Peon can respond with "202 Accepted" instead of the usual "200 OK" or 
"400 Bad Request". Although it's not clear to me yet why 202 was issued and it 
never actually paused. In our case, it stayed in "STARTING" phase when 
responding back to "/status" HTTP calls after the 202 and seemed to remain 
there until killed due to it switching to the "PUBLISHING" phase when another 
"/pause" request came. According to the code/logs, PUBLISHING is an unpausable 
state and throws an exception***
   
   Some Relevant info below:
   * Found here - 
https://github.com/apache/druid/blob/master/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java
 (See lines 1807-1816)
   * According to the comments:
   ```
     /**
      * Signals the ingestion loop to pause.
      *
      * @return one of the following Responses: 400 Bad Request if the task has 
started publishing; 202 Accepted if the
      * method has timed out and returned before the task has paused; 200 OK 
with a map of the current partition sequences
      * in the response body if the task successfully paused
      */
   ```
   * In my task example, this log snippet is where it asked to "pause" and 
responded with 202. I also noticed there is a "pause" request right before this 
one. The initial pause request before the 202 comes at ~50 minutes after task 
start and was successful (200 OK) and it is then resumed and shows as 
"STARTING" phase:
   ```
   ...
   2022-07-06T21:56:35,617 DEBUG [qtp323665272-156] 
org.eclipse.jetty.server.HttpOutput - write(array) 
s=OPEN,api=BLOCKING,sc=false,e=null aggregated !flush 
HeapByteBuffer@5760b9c0[p=0,l=44,c=32768,r=44]={<<<Request accepted but task 
has not yet paused>>>2-07-06T2...\x00\x00\x00\x00\x00\x00\x00}
   2022-07-06T21:56:35,617 DEBUG [qtp323665272-156] 
org.eclipse.jetty.server.handler.gzip.GzipHttpOutputInterceptor - 
org.eclipse.jetty.server.handler.gzip.GzipHttpOutputInterceptor@299973cb 
compressing java.util.zip.Deflater@52483be5
   2022-07-06T21:56:35,617 DEBUG [qtp323665272-156] 
org.eclipse.jetty.server.HttpChannel - sendResponse info=null 
content=HeapByteBuffer@716022[p=0,l=10,c=32768,r=10]={<<<\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00>>>\x00o\xA1\xD0\r\x8b\x06\xC7\x85...\x00\x00\x00\x00\x00\x00\x00}
 complete=false committing=true 
callback=GzipBufferCB@e12c8b4[content=HeapByteBuffer@5760b9c0[p=44,l=44,c=32768,r=0]={Request
 a... paused<<<>>>2-07-06T2...\x00\x00\x00\x00\x00\x00\x00} last=false 
copy=null 
buffer=HeapByteBuffer@716022[p=0,l=10,c=32768,r=10]={<<<\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00>>>\x00o\xA1\xD0\r\x8b\x06\xC7\x85...\x00\x00\x00\x00\x00\x00\x00}
 deflate=java.util.zip.Deflater@52483be5 ]
   2022-07-06T21:56:35,617 DEBUG [qtp323665272-156] 
org.eclipse.jetty.server.HttpChannel - COMMIT for 
/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause on 
HttpChannelOverHttp@427bc76a{s=HttpChannelState@2d006490{s=HANDLING rs=BLOCKING 
os=COMMITTED is=IDLE awp=false se=false i=true 
al=0},r=1,c=false/false,a=HANDLING,uri=//REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause,age=2002}
   202 Accepted HTTP/1.1
   Date: Wed, 06 Jul 2022 21:56:33 GMT
   X-Druid-Task-Id: index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc
   Content-Type: application/json
   Vary: Accept-Encoding, User-Agent
   Content-Encoding: gzip
   ...
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to