Hong Liang Teoh created FLINK-32535: ---------------------------------------
Summary: CheckpointingStatisticsHandler periodically returns NullArgumentException after job restarts Key: FLINK-32535 URL: https://issues.apache.org/jira/browse/FLINK-32535 Project: Flink Issue Type: Bug Components: Runtime / REST Affects Versions: 1.17.1, 1.16.2 Reporter: Hong Liang Teoh Fix For: 1.18.0 *What* When making requests to /checkpoints REST API after a job restart, we see 500 for a short period of time. We should handle this gracefully in the CheckpointingStatisticsHandler. *How to replicate* * Checkpointing interval 1s * Job is constantly restarting * Make constant requests to /checkpoints REST API. Stack trace: {{org.apache.commons.math3.exception.NullArgumentException: input array}} {{ at org.apache.commons.math3.util.MathArrays.verifyValues(MathArrays.java:1753)}} {{ at org.apache.commons.math3.stat.descriptive.AbstractUnivariateStatistic.test(AbstractUnivariateStatistic.java:158)}} {{ at org.apache.commons.math3.stat.descriptive.rank.Percentile.evaluate(Percentile.java:272)}} {{ at org.apache.commons.math3.stat.descriptive.rank.Percentile.evaluate(Percentile.java:241)}} {{ at org.apache.flink.runtime.metrics.DescriptiveStatisticsHistogramStatistics$CommonMetricsSnapshot.getPercentile(DescriptiveStatisticsHistogramStatistics.java:159)}} {{ at org.apache.flink.runtime.metrics.DescriptiveStatisticsHistogramStatistics.getQuantile(DescriptiveStatisticsHistogramStatistics.java:53)}} {{ at org.apache.flink.runtime.checkpoint.StatsSummarySnapshot.getQuantile(StatsSummarySnapshot.java:108)}} {{ at org.apache.flink.runtime.rest.messages.checkpoints.StatsSummaryDto.valueOf(StatsSummaryDto.java:81)}} {{ at org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.createCheckpointingStatistics(CheckpointingStatisticsHandler.java:133)}} {{ at org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.handleCheckpointStatsRequest(CheckpointingStatisticsHandler.java:85)}} {{ at org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.handleCheckpointStatsRequest(CheckpointingStatisticsHandler.java:59)}} {{ at org.apache.flink.runtime.rest.handler.job.checkpoints.AbstractCheckpointStatsHandler.lambda$handleRequest$1(AbstractCheckpointStatsHandler.java:62)}} {{ at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:642)}} {{ at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)}} {{ at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)}} {{ at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)}} {{ at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)}} {{ at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}} {{ at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}} {{ at java.base/java.lang.Thread.run(Thread.java:829)\n}} See graphs here for tests. The dips in the green line correspond to the failures immediately after a job restart. !https://user-images.githubusercontent.com/35062175/250529297-908a6714-ea15-4aac-a7fc-332589da2582.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)