mmiklavc commented on a change in pull request #1197: METRON-1778 Out-of-order
timestamps may delay flush in Storm Profiler
URL: https://github.com/apache/metron/pull/1197#discussion_r260440810
##########
File path:
metron-analytics/metron-profiler-storm/src/main/java/org/apache/metron/profiler/storm/FixedFrequencyFlushSignal.java
##########
@@ -74,31 +72,34 @@ public void reset() {
*/
@Override
public void update(long timestamp) {
+ if(LOG.isWarnEnabled()) {
+ checkIfOutOfOrder(timestamp);
+ }
- if(timestamp > currentTime) {
-
- // need to update current time
- LOG.debug("Updating current time; last={}, new={}", currentTime,
timestamp);
- currentTime = timestamp;
-
- } else if ((currentTime - timestamp) > flushFrequency) {
-
- // significantly out-of-order timestamps
- LOG.warn("Timestamps out-of-order by '{}' ms. This may indicate a
problem in the data. last={}, current={}",
- (currentTime - timestamp),
- timestamp,
- currentTime);
+ if(timestamp < minTime) {
+ minTime = timestamp;
}
- if(flushTime == 0) {
+ if(timestamp > maxTime) {
+ maxTime = timestamp;
+ }
+ }
- // set the next time to flush
- flushTime = currentTime + flushFrequency;
- LOG.debug("Setting flush time; '{}' ms until flush; flushTime={},
currentTime={}, flushFreq={}",
- timeToNextFlush(),
- flushTime,
- currentTime,
- flushFrequency);
+ /**
+ * Checks if the timestamp is significantly out-of-order.
+ *
+ * @param timestamp The last timestamp.
+ */
+ private void checkIfOutOfOrder(long timestamp) {
+ // do not warn if this is the first timestamp we've seen, which will
always be 'out-of-order'
+ if (maxTime > Long.MIN_VALUE) {
+
+ long outOfOrderBy = maxTime - timestamp;
+ if (Math.abs(outOfOrderBy) > flushFrequency) {
Review comment:
@nickwallen I enjoyed your prose on this one haha. I think your inclination
to log something is reasonable, though it does add some risk of flooding the
logs or filing disk in the event of a large volume of stale or incorrectly
stamped data being suddenly loaded. I still think it's reasonable,
notwithstanding this potential risk. We might want to consider watermarking as
a more robust enhancement for solving this problem down the road. My +1 stands,
thanks for the detailed explanation.
https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services