mmiklavc commented on a change in pull request #1197: METRON-1778 Out-of-order 
timestamps may delay flush in Storm Profiler
URL: https://github.com/apache/metron/pull/1197#discussion_r260464684
 
 

 ##########
 File path: 
metron-analytics/metron-profiler-storm/src/main/java/org/apache/metron/profiler/storm/FixedFrequencyFlushSignal.java
 ##########
 @@ -74,31 +72,34 @@ public void reset() {
    */
   @Override
   public void update(long timestamp) {
+    if(LOG.isWarnEnabled()) {
+      checkIfOutOfOrder(timestamp);
+    }
 
-    if(timestamp > currentTime) {
-
-      // need to update current time
-      LOG.debug("Updating current time; last={}, new={}", currentTime, 
timestamp);
-      currentTime = timestamp;
-
-    } else if ((currentTime - timestamp) > flushFrequency) {
-
-      // significantly out-of-order timestamps
-      LOG.warn("Timestamps out-of-order by '{}' ms. This may indicate a 
problem in the data. last={}, current={}",
-              (currentTime - timestamp),
-              timestamp,
-              currentTime);
+    if(timestamp < minTime) {
+      minTime = timestamp;
     }
 
-    if(flushTime == 0) {
+    if(timestamp > maxTime) {
+      maxTime = timestamp;
+    }
+  }
 
-      // set the next time to flush
-      flushTime = currentTime + flushFrequency;
-      LOG.debug("Setting flush time; '{}' ms until flush; flushTime={}, 
currentTime={}, flushFreq={}",
-              timeToNextFlush(),
-              flushTime,
-              currentTime,
-              flushFrequency);
+  /**
+   * Checks if the timestamp is significantly out-of-order.
+   *
+   * @param timestamp The last timestamp.
+   */
+  private void checkIfOutOfOrder(long timestamp) {
+    // do not warn if this is the first timestamp we've seen, which will 
always be 'out-of-order'
+    if (maxTime > Long.MIN_VALUE) {
+
+      long outOfOrderBy = maxTime - timestamp;
+      if (Math.abs(outOfOrderBy) > flushFrequency) {
 
 Review comment:
   > Unfortunately a watermark doesn't stop time from being incorrectly 
advanced too far into the future, like my example. It's only a mechanism to 
deal with late data.
   
   Absolutely - my expectation is that we would have to introduce a more robust 
mechanism to handle this better. In all likelihood, we won't end up using 
structured streaming, but there are lessons to pull from their handling of late 
data. 
   
   For incorrectly timestamped data that's too far into the future, I'm not 
aware of an explicit built-in mechanism for this. Though, I'm not sure the onus 
should fall on the profiler completely. I think we should attempt to handle 
this at least in part in parsers or enrichment as a sort of sanity check 
(though, still never dropping data permanently and writing it to an error 
topic, of course). Maybe that logic can be shared with the profiler, actually. 
I think we'll want some kind of max temporal delta as a trigger for whether or 
not we send a message to the error queue. In your example of a timestamp 2 days 
into the future, that clearly seems like a system malfunction or 
misconfiguration. Perhaps it's an incorrect timezone translation, or some other 
anomaly, but I would expect that to be an exceptional case, and I think we 
should ultimately exclude it from calculation and notify admins as an exception 
or error. 
   
   Batch jobs are another beast, but at least we have more control over total 
ordering and sanitization there than we do with streaming data, and I'd think 
we're less likely to see "future" timestamps in a batch. Some issues with a 
batch could be:
   
   1. Entire batch of timestamps is wrong or missing
   2. Some timestamps extremely _EARLY_ in relation to rest of batch (ie 
probably don't belong to the batch)
   3. Some timestamps extremely _LATE_ in relation to rest of batch (ie 
probably don't belong to the batch)
   4. Completely out of order - needs sorting by timestamp
   5. Homogenous timestamp on all batch records - temporal dilation, 
effectively from all records being tagged with the same batch time rather than 
the individual event times

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to