Hi, Thanks for reviving this thread. I think having the watermark is very good. Some runners, for example Dataflow and Flink have their own internal metric for the watermark but having it cross-runner seems beneficial (if maybe a bit wasteful).
Best, Aljoscha > On 2. Jun 2017, at 03:52, JingsongLee <lzljs3620...@aliyun.com> wrote: > > @Aviem Zur @Ben Chambers What do you think about the value of > METRIC_MAX_SPLITS? > > ------------------------------------------------------------------From:JingsongLee > <lzljs3620...@aliyun.com>Time:2017 May 11 (Thu) 16:37To:dev@beam.apache.org > <dev@beam.apache.org>Subject:[DISCUSS] Source Watermark Metrics > Hi everyone, > > The source watermark metrics show the consumer latency of Source. > It allows the user to know the health of the job, or it can be used to > monitor and alarm. > We should have the runner report the watermark metricsrather than > having the source report it using metrics. This addresses the fact that even > if the source has advanced to 8:00, the runner may still know about buffered > elements at 7:00, and so not advance the watermark all the way to 8:00. > The metrics Includes: > 1.Source watermark (`min` amongst all splits): > type = Gauge, namespace = io, name = source_watermark > 2.Source watermark per split: > type = Gauge, namespace = io.splits, name = <split_id>.source_watermark > > Min Source watermark amongst all splits seems difficult to implement since > some runners(like FlinkRunner) can't access to all the splits to aggregate > and there is no such AggregatorMetric. > > So We could report watermark per split and users could use a `min` > aggregation on this in their metrics backends. However, as was mentioned > in the IO metrics proposal by several people this could be problematic in > sources with many splits. > > So we do a check when report metrics to solve the problem of too many splits. > {code} > if (splitsNum <= METRIC_MAX_SPLITS) { > // set the sourceWatermarkOfSplit > } > {code} > > So I'd like to take a discussion to the implement of source watermark metrics > and specific how many splits is too many. (the value of METRIC_MAX_SPLITS) > > JIRA: > IO metrics (https://issues.apache.org/jira/browse/BEAM-1919) > Source watermark (https://issues.apache.org/jira/browse/BEAM-1941) >