maytasm commented on PR #17847:
URL: https://github.com/apache/druid/pull/17847#issuecomment-2779705752

   @kfaraz 
   We want to be able to measure and ensure that our e2e streaming ingestion 
latency is within some x latency (for SLA). For example, we may have a use case 
that we need to make sure data is available within x time after it is produced 
to Kafka. Currently, Druid calculates ingest/events/messageGap as Time gap in 
milliseconds between the latest ingested event timestamp (during that emission 
period) and the current system timestamp of metrics emission. This results in 
the **minimum** gap and is not very useful. We cannot measure, track, or ensure 
our SLA with how the ingest/events/messageGap is currently being calculated.
   
   Here is an example:
   Emission period of 5secs between t0 to t4:
   
   At t0, we have a message arriving with timestamp of t-500
   At t1, we have a message arriving with timestamp of t-499
   At t2, we have a message arriving with timestamp of t-499
   At t3, we have a message arriving with timestamp of t-499
   At t4, we have a message arriving with timestamp of t3
   
   The above is an example where we process 5 rows of data. t3 is the latest 
ingested event timestamp in this period
   When we emits the metric at t4, we would calculate message gap as t4 - t3 = 
1 sec gap.
   This disregards all the earlier late messages.  
   For example, in the above, if our SLA to our users is 5 secs, the 
ingest/events/messageGap reported as 1 sec gap would seems like we are within 
SLA but in fact 80% of our messages are more than 500seconds late!
   
   We want to improve it by:
   
   - Calculate the messageGap for each message individually
       - i.e. At t0, we have a message arriving with timestamp of t-500. This 
should be record as a 500sec messageGap
   
   - Report either a distribution (not sure if this is possible with Druid 
metric system) or a min/max/avg of the messageGaps we saw in an emission period 
(min/max/avg would still be useful)
       - The above, we saw the messageGap of 500seconds, 500seconds, 
501seconds, 502seconds, 1 second. We should report min of 1s, max of 502s, avg 
400.8s
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to