[
https://issues.apache.org/jira/browse/FLINK-36071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rui Fan updated FLINK-36071:
----------------------------
Description:
A series of flink metrics are using the System.currentTimeMillis[1] to measure
the elapsed time. I propose to refactor them from System.currentTimeMillis to
System.nanoTime[2].
h1. Why do we need to refactor it?
Note: High precision *{color:#de350b}is not{color}* the reason for refactor.
Actually, System.currentTimeMillis() and System.nanoTime() have completely
different semantics.
System.currentTimeMillis() *{color:#de350b}!={color}* System.nanoTime() /
1_000_000
* System.currentTimeMillis() is current system time of the server.
** The time can be updated by NTP[3], or it can be adjusted manually.
** Therefore, when we use System.currentTimeMillis, the end time may be less
than the start time
* System.nanoTime() usually indicates the length of time since the operating
system was booted.
** So System.nanoTime isn't system time, and it's not effected by system time.
** System.nanoTime (inside the process) is monotonically increasing and never
goes back.
** As the job doc[2] mentioned: this method can only be used to measure
elapsed time and is not related to any other notion of system or wall-clock
time.
Here is a blog[4] to explain their difference in detail.
h1. Current use cases:
Based on last part, we know the System.nanoTime is recommended for measuring
the duration.
Most of tracing systems are using it, and flink also uses it to measure the
duration for some metrics, such as:
* all latency tracks of state backend
* SubtaskCheckpointCoordinatorImpl#takeSnapshotSync measures the checkpoint
Sync Duration
* etc
In addition, the Clock[5] of flink extracted the absoluteTimeMillis,
relativeTimeMillis and relativeTimeNanos before. But I guess most of developers
doesn't know these details.
* absoluteTimeMillis is using System.currentTimeMillis
* relativeTimeMillis and relativeTimeNanos are using System.nanoTime
* It's better to call relativeTimeNanos or absoluteTimeMillis instead of
absoluteTimeMillis for all duration related metrics
h1. Proposed changes:
This jira proposes that Flink uses System.nanoTime uniformly for duration
calculation.
Currently, many components still use System.currentTimeMillis to calculate
duration, it includes:
* TimerGauge
* TaskIOMetricGroup
* ThroughputCalculator
* DeploymentStateTimeMetrics
* A lof of methods of StreamTask
* etc
[1]
[https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#currentTimeMillis--]
[2] [https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime--]
[3] [https://en.wikipedia.org/wiki/Network_Time_Protocol]
[4]
[https://www.javaadvent.com/2019/12/measuring-time-from-java-to-kernel-and-back.html]
[5]
[https://github.com/apache/flink/blob/729b8b81a77ba6c32711216b88a1bf57ccddfadc/flink-core/src/main/java/org/apache/flink/util/clock/Clock.java#L40]
was:
A series of flink metrics are using the System.currentTimeMillis[1] to measure
the elapsed time. I propose to refactor them from System.currentTimeMillis to
System.nanoTime[2].
h1. Why do we need to refactor it?
Note: High precision *{color:#de350b}is not{color}* the reason for refactor.
Actually, System.currentTimeMillis() and System.nanoTime() have completely
different semantics.
System.currentTimeMillis() *{color:#de350b}!={color}* System.nanoTime() /
1_000_000
* System.currentTimeMillis() is current system time of the server.
** The time can be updated by NTP[3], or it can be adjusted manually
* System.nanoTime() usually indicates the length of time since the operating
system was booted.
** So System.nanoTime isn't system time, and it's not effected by system time.
** System.nanoTime (inside the process) is monotonically increasing and never
goes back.
** As the job doc[2] mentioned: this method can only be used to measure
elapsed time and is not related to any other notion of system or wall-clock
time.
Here is a blog[4] to explain their difference in detail.
h1. Current use cases:
Based on last part, we know the System.nanoTime is recommended for measuring
the duration.
Most of tracing systems are using it, and flink also uses it to measure the
duration for some metrics, such as:
* all latency tracks of state backend
* SubtaskCheckpointCoordinatorImpl#takeSnapshotSync measures the checkpoint
Sync Duration
* etc
In addition, the Clock[5] of flink extracted the absoluteTimeMillis,
relativeTimeMillis and relativeTimeNanos before. But I guess most of developers
doesn't know these details.
* absoluteTimeMillis is using System.currentTimeMillis
* relativeTimeMillis and relativeTimeNanos are using System.nanoTime
* It's better to call relativeTimeNanos or absoluteTimeMillis instead of
absoluteTimeMillis for all duration related metrics
h1. Proposed changes:
This jira proposes that Flink uses System.nanoTime uniformly for duration
calculation.
Currently, many components still use System.currentTimeMillis to calculate
duration, it includes:
* TimerGauge
* TaskIOMetricGroup
* ThroughputCalculator
* DeploymentStateTimeMetrics
* A lof of methods of StreamTask
* etc
[1]
[https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#currentTimeMillis--]
[2] [https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime--]
[3] [https://en.wikipedia.org/wiki/Network_Time_Protocol]
[4]
[https://www.javaadvent.com/2019/12/measuring-time-from-java-to-kernel-and-back.html]
[5]
[https://github.com/apache/flink/blob/729b8b81a77ba6c32711216b88a1bf57ccddfadc/flink-core/src/main/java/org/apache/flink/util/clock/Clock.java#L40]
> Using System.nanoTime to measure the elapsed time instead of
> System.currentTimeMillis
> -------------------------------------------------------------------------------------
>
> Key: FLINK-36071
> URL: https://issues.apache.org/jira/browse/FLINK-36071
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Metrics
> Reporter: Rui Fan
> Assignee: Rui Fan
> Priority: Major
>
> A series of flink metrics are using the System.currentTimeMillis[1] to
> measure the elapsed time. I propose to refactor them from
> System.currentTimeMillis to System.nanoTime[2].
> h1. Why do we need to refactor it?
> Note: High precision *{color:#de350b}is not{color}* the reason for refactor.
> Actually, System.currentTimeMillis() and System.nanoTime() have completely
> different semantics.
> System.currentTimeMillis() *{color:#de350b}!={color}* System.nanoTime() /
> 1_000_000
> * System.currentTimeMillis() is current system time of the server.
> ** The time can be updated by NTP[3], or it can be adjusted manually.
> ** Therefore, when we use System.currentTimeMillis, the end time may be less
> than the start time
> * System.nanoTime() usually indicates the length of time since the operating
> system was booted.
> ** So System.nanoTime isn't system time, and it's not effected by system
> time.
> ** System.nanoTime (inside the process) is monotonically increasing and
> never goes back.
> ** As the job doc[2] mentioned: this method can only be used to measure
> elapsed time and is not related to any other notion of system or wall-clock
> time.
> Here is a blog[4] to explain their difference in detail.
> h1. Current use cases:
> Based on last part, we know the System.nanoTime is recommended for measuring
> the duration.
> Most of tracing systems are using it, and flink also uses it to measure the
> duration for some metrics, such as:
> * all latency tracks of state backend
> * SubtaskCheckpointCoordinatorImpl#takeSnapshotSync measures the checkpoint
> Sync Duration
> * etc
> In addition, the Clock[5] of flink extracted the absoluteTimeMillis,
> relativeTimeMillis and relativeTimeNanos before. But I guess most of
> developers doesn't know these details.
> * absoluteTimeMillis is using System.currentTimeMillis
> * relativeTimeMillis and relativeTimeNanos are using System.nanoTime
> * It's better to call relativeTimeNanos or absoluteTimeMillis instead of
> absoluteTimeMillis for all duration related metrics
> h1. Proposed changes:
> This jira proposes that Flink uses System.nanoTime uniformly for duration
> calculation.
> Currently, many components still use System.currentTimeMillis to calculate
> duration, it includes:
> * TimerGauge
> * TaskIOMetricGroup
> * ThroughputCalculator
> * DeploymentStateTimeMetrics
> * A lof of methods of StreamTask
> * etc
> [1]
> [https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#currentTimeMillis--]
> [2]
> [https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime--]
> [3] [https://en.wikipedia.org/wiki/Network_Time_Protocol]
> [4]
> [https://www.javaadvent.com/2019/12/measuring-time-from-java-to-kernel-and-back.html]
> [5]
> [https://github.com/apache/flink/blob/729b8b81a77ba6c32711216b88a1bf57ccddfadc/flink-core/src/main/java/org/apache/flink/util/clock/Clock.java#L40]
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)