mmiklavc commented on a change in pull request #1197: METRON-1778 Out-of-order
timestamps may delay flush in Storm Profiler
URL: https://github.com/apache/metron/pull/1197#discussion_r260464684
##########
File path:
metron-analytics/metron-profiler-storm/src/main/java/org/apache/metron/profiler/storm/FixedFrequencyFlushSignal.java
##########
@@ -74,31 +72,34 @@ public void reset() {
*/
@Override
public void update(long timestamp) {
+ if(LOG.isWarnEnabled()) {
+ checkIfOutOfOrder(timestamp);
+ }
- if(timestamp > currentTime) {
-
- // need to update current time
- LOG.debug("Updating current time; last={}, new={}", currentTime,
timestamp);
- currentTime = timestamp;
-
- } else if ((currentTime - timestamp) > flushFrequency) {
-
- // significantly out-of-order timestamps
- LOG.warn("Timestamps out-of-order by '{}' ms. This may indicate a
problem in the data. last={}, current={}",
- (currentTime - timestamp),
- timestamp,
- currentTime);
+ if(timestamp < minTime) {
+ minTime = timestamp;
}
- if(flushTime == 0) {
+ if(timestamp > maxTime) {
+ maxTime = timestamp;
+ }
+ }
- // set the next time to flush
- flushTime = currentTime + flushFrequency;
- LOG.debug("Setting flush time; '{}' ms until flush; flushTime={},
currentTime={}, flushFreq={}",
- timeToNextFlush(),
- flushTime,
- currentTime,
- flushFrequency);
+ /**
+ * Checks if the timestamp is significantly out-of-order.
+ *
+ * @param timestamp The last timestamp.
+ */
+ private void checkIfOutOfOrder(long timestamp) {
+ // do not warn if this is the first timestamp we've seen, which will
always be 'out-of-order'
+ if (maxTime > Long.MIN_VALUE) {
+
+ long outOfOrderBy = maxTime - timestamp;
+ if (Math.abs(outOfOrderBy) > flushFrequency) {
Review comment:
> Unfortunately a watermark doesn't stop time from being incorrectly
advanced too far into the future, like my example. It's only a mechanism to
deal with late data.
Absolutely - my expectation is that we would have to introduce a more robust
mechanism to handle this better. In all likelihood, we won't end up using
structured streaming, but there are lessons to pull from their handling of late
data.
For incorrectly timestamped data that's too far into the future, I'm not
aware of an explicit built-in mechanism for this. Though, I'm not sure the onus
should fall on the profiler completely. I think we should attempt to handle
this at least in part in parsers or enrichment as a sort of sanity check
(though, still never dropping data permanently and writing it to an error
topic, of course). Maybe that logic can be shared with the profiler, actually.
I think we'll want some kind of max temporal delta as a trigger for whether or
not we send a message to the error queue. In your example of a timestamp 2 days
into the future, that clearly seems like a system malfunction or
misconfiguration. Perhaps it's an incorrect timezone translation, or some other
anomaly, but I would expect that to be an exceptional case, and I think we
should ultimately exclude it from calculation and notify admins as an exception
or error.
Batch jobs are another beast, but at least we have more control over total
ordering and sanitization there than we do with streaming data, and I'd think
we're less likely to see "future" timestamps in a batch. Some issues with a
batch could be:
1. Entire batch of timestamps is wrong or missing
2. Some timestamps extremely _EARLY_ in relation to rest of batch (ie
probably don't belong to the batch)
3. Some timestamps extremely _LATE_ in relation to rest of batch (ie
probably don't belong to the batch)
4. Completely out of order - needs sorting by timestamp
5. Homogenous timestamp on all batch records - temporal dilation,
effectively from all records being tagged with the same batch time rather than
the individual event times
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services