jvaesteves opened a new issue #1599:
URL: https://github.com/apache/incubator-hudi/issues/1599
Hello everyone, I am currently benckmarking Hudi against other solutions for
Spark Streaming in terms of time and bytes used. For other solution, I am
extracting **durationMs** and **stateOperators** from MicroBatchExecution log,
but for Hudi, stateOperators comes empty.
I read at the documentation about the integration with Graphite, but when I
set the parameters for Hudi to send metrics to the same location (I tried for
localhost or for an external IP) of Spark Metrics (which worked), I started to
get errors after the second first microbatch like:
```
20/05/06 11:49:25 ERROR Metrics: Failed to send metrics:
java.lang.IllegalArgumentException: A metric named
kafka_hudi.finalize.duration already exists
at
org.apache.hudi.com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:97)
at org.apache.hudi.metrics.Metrics.registerGauge(Metrics.java:83)
at
org.apache.hudi.metrics.HoodieMetrics.updateFinalizeWriteMetrics(HoodieMetrics.java:177)
at
org.apache.hudi.HoodieWriteClient.lambda$finalizeWrite$14(HoodieWriteClient.java:1233)
at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
at
org.apache.hudi.HoodieWriteClient.finalizeWrite(HoodieWriteClient.java:1231)
at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:497)
at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:479)
at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:470)
at
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:152)
at
org.apache.hudi.HoodieStreamingSink$$anonfun$1$$anonfun$2.apply(HoodieStreamingSink.scala:51)
at
org.apache.hudi.HoodieStreamingSink$$anonfun$1$$anonfun$2.apply(HoodieStreamingSink.scala:51)
at scala.util.Try$.apply(Try.scala:192)
at
org.apache.hudi.HoodieStreamingSink$$anonfun$1.apply(HoodieStreamingSink.scala:50)
at
org.apache.hudi.HoodieStreamingSink$$anonfun$1.apply(HoodieStreamingSink.scala:50)
at
org.apache.hudi.HoodieStreamingSink.retry(HoodieStreamingSink.scala:114)
at
org.apache.hudi.HoodieStreamingSink.addBatch(HoodieStreamingSink.scala:49)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
at
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:84)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
at
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
at
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
at
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
at
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
```
Even though it appears that it successfully sent at least once, I was never
able to retrieve these data. Also, for the external IP, I have successfully
opened a connection for the address using telnet, so I know that its not a
security group issue.
So, what I want to know is if Hudi push metrics to and endpoint like Spark,
or if it just exposes those metrics for another to pul. And what could I have
done wrong for this application?
Here is my config object for Hudi:
```scala
val hudiOptions = Map[String,String](
HoodieWriteConfig.TABLE_NAME -> hudiTableName,
DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "key",
DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "dt",
DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "timestamp",
DataSourceWriteOptions.OPERATION_OPT_KEY -> "insert",
DataSourceWriteOptions.INSERT_DROP_DUPS_OPT_KEY -> "true",
HoodieCompactionConfig.CLEANER_COMMITS_RETAINED_PROP -> "1",
HoodieMetricsConfig.METRICS_ON -> "true",
HoodieMetricsConfig.METRICS_REPORTER_TYPE -> "GRAPHITE",
HoodieMetricsConfig.GRAPHITE_SERVER_HOST -> "10.115.52.63",
HoodieMetricsConfig.GRAPHITE_SERVER_PORT -> "32683",
HoodieMetricsConfig.GRAPHITE_METRIC_PREFIX -> "hudi"
)
```
Environment Description
- Hudi version: 0.5.0
- Spark version : 2.4.4
- Hive version : 2.3.6
- Hadoop version : Amazon 2.8.5 (emr-5.29.0)
- Storage (HDFS/S3/GCS..) : S3
- Running on Docker? (yes/no) : No
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]