[GitHub] [hudi] kazdy opened a new issue, #5426: [SUPPORT]

GitBox Mon, 25 Apr 2022 12:57:19 -0700


kazdy opened a new issue, #5426:
URL: https://github.com/apache/hudi/issues/5426


   **Describe the problem you faced**
   
   I observed a weird thing in hudi metrics (i use cloudwatch reporter). 
   When I stopped the structured streaming job it suddenly added around 100 
commits to the metrics (see the picture). Same with pending compaction metrics, 
it suddenly started showing 178 pending compactions.
   This is the second time I see this when stopping job.
   Timeline has as many commits as expected. It's only in metrics.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Enable cloudwatch metrics reporter
   2. Run structured streaming job with forEachBatch() sink
   3. Stop job using `yarn application -kill appid`
   4. Observe additional commitsin metrics
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.10.1 OSS
   
   * Spark version : 3.1.2-amzn (EMR on Ec2 with Yarn)
   
   * Hive version : --
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   code:
   https://gist.github.com/kazdy/a3a95aecf0a7dfb6b9ba62a54e9214c9
   spark-submit command:
   ```
   spark-submit \
   --master yarn \
   --deploy-mode cluster \
   --executor-memory 9g \
   --driver-memory 24g \
   --executor-cores 2 \
   --driver-cores 4 \
   --conf "spark.dynamicAllocation.executorIdleTimeout=600" \
   --conf "spark.driver.extraJavaOptions=-XX:+PrintTenuringDistribution 
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps 
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof" \
   --conf "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal 
-XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions 
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof" \
   --conf "spark.yarn.max.executor.failures=100" \
   --conf "spark.task.maxFailures=4" \
   --conf "spark.rdd.compress=true" \
   --conf "spark.shuffle.compress=true" \
   --conf "spark.shuffle.spill.compress=true" \
   --conf "spark.kryoserializer.buffer.max=512m" \
   --conf "hoodie.upsert.shuffle.parallelism=10" \
   --conf "hoodie.insert.shuffle.parallelism=10" \
   --conf "spark.sql.shuffle.partitions=8" \
   --conf "spark.default.parallelism=8" \
   --conf "spark.driver.maxResultSize=4g" \
   --conf "spark.streaming.stopGracefullyOnShutdown=true" \
   --conf "spark.streaming.backpressure.enabled=true" \
   --conf "spark.driver.memoryOverhead=3000" \
   --conf "spark.executor.memoryOverhead=2048" \
   --packages 
"org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2"
 \
   s3://bucket/mor_streaming.py
   ```
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] kazdy opened a new issue, #5426: [SUPPORT]

Reply via email to