Akira Ajisaka created HUDI-7107:
-----------------------------------
Summary: Reused MetricsReporter fails to publish metrics in Spark
streaming job
Key: HUDI-7107
URL: https://issues.apache.org/jira/browse/HUDI-7107
Project: Apache Hudi
Issue Type: Bug
Components: metrics
Reporter: Akira Ajisaka
A customer runs AWS Glue 4.0 streaming job (based on Spark 3.3.0) using Apache
Hudi 0.14.0 libraries. The customer enabled [Hudi CW metrics
reporter|https://hudi.apache.org/docs/metrics/#aws-cloudwatchreporter] in the
job. It succeeded to publish metrics at first batch, however, after that it
started to fail. Therefore there’s only one data sample published to CloudWatch
metrics. The error stacktrace is as follows:
{noformat}
2023-11-09 15:59:17,775 ERROR [stream execution thread for [id =
d31c62e2-e697-40b7-b6da-854cf9a8cb14, runId =
0f2325c6-83f4-4dfa-a849-3b3189423a9b]] cloudwatch.CloudWatchReporter
(CloudWatchReporter.java:report(236)): Error reporting metrics to CloudWatch.
The data in this CloudWatch request may have been discarded, and not made it to
CloudWatch.
java.util.concurrent.ExecutionException:
org.apache.hudi.software.amazon.awssdk.core.exception.SdkClientException:
Unable to execute HTTP request: event executor terminated
at
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
~[?:1.8.0_382]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
~[?:1.8.0_382]
at
org.apache.hudi.aws.cloudwatch.CloudWatchReporter.report(CloudWatchReporter.java:234)
~[hudi-aws-bundle-0.14.0.jar:0.14.0]
at
org.apache.hudi.aws.cloudwatch.CloudWatchReporter.report(CloudWatchReporter.java:211)
~[hudi-aws-bundle-0.14.0.jar:0.14.0]
at
org.apache.hudi.com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:237)
~[hudi-aws-bundle-0.14.0.jar:0.14.0]
at
org.apache.hudi.metrics.cloudwatch.CloudWatchMetricsReporter.report(CloudWatchMetricsReporter.java:71)
~[hudi-utilities-bundle_2.12-0.14.0.jar:0.14.0]
at java.util.ArrayList.forEach(ArrayList.java:1259) ~[?:1.8.0_382]
at org.apache.hudi.metrics.Metrics.shutdown(Metrics.java:116)
~[hudi-utilities-bundle_2.12-0.14.0.jar:0.14.0]
at java.util.HashMap$Values.forEach(HashMap.java:982) ~[?:1.8.0_382]
at org.apache.hudi.metrics.Metrics.shutdownAllMetrics(Metrics.java:88)
~[hudi-utilities-bundle_2.12-0.14.0.jar:0.14.0]
at
org.apache.hudi.HoodieSparkSqlWriter$.cleanup(HoodieSparkSqlWriter.scala:937)
~[hudi-utilities-bundle_2.12-0.14.0.jar:0.14.0]
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:151)
~[hudi-utilities-bundle_2.12-0.14.0.jar:0.14.0] {noformat}
This error comes from AWS Java SDK v2:
{noformat}
Caused by: java.util.concurrent.RejectedExecutionException: event executor
terminated
at
io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:923)
~[netty-common-4.1.74.Final.jar:4.1.74.Final]
at
io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:350)
~[netty-common-4.1.74.Final.jar:4.1.74.Final]
at
io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:343)
~[netty-common-4.1.74.Final.jar:4.1.74.Final]
at
io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:825)
~[netty-common-4.1.74.Final.jar:4.1.74.Final]
at
io.netty.util.concurrent.SingleThreadEventExecutor.lazyExecute(SingleThreadEventExecutor.java:820)
~[netty-common-4.1.74.Final.jar:4.1.74.Final]
at
io.netty.util.concurrent.AbstractScheduledEventExecutor.schedule(AbstractScheduledEventExecutor.java:263)
~[netty-common-4.1.74.Final.jar:4.1.74.Final]
at
io.netty.util.concurrent.AbstractScheduledEventExecutor.schedule(AbstractScheduledEventExecutor.java:177)
~[netty-common-4.1.74.Final.jar:4.1.74.Final]
at
io.netty.util.concurrent.AbstractEventExecutorGroup.schedule(AbstractEventExecutorGroup.java:50)
~[netty-common-4.1.74.Final.jar:4.1.74.Final]
at
org.apache.hudi.software.amazon.awssdk.http.nio.netty.internal.DelegatingEventLoopGroup.schedule(DelegatingEventLoopGroup.java:153)
~[hudi-aws-bundle-0.14.0.jar:0.14.0]
{noformat}
I've observed the MetricsReporter is shutdown after the 1st batch, however, the
MetricsReporter instance is reused in the subsequent batches and it fails to
report metrics.
Given MetricsReporter is not implemented as reusable, we should avoid reusing
them.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)