ehurheap opened a new issue, #9480: URL: https://github.com/apache/hudi/issues/9480
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** create savepoint fails with OutOfMemoryError **To Reproduce** Steps to reproduce the behavior: 1. hudi MOR table with 23,000 partitions 2. metadata table is NOT enabled 3. in cli: ``` - hudi:events->set --conf spark.yarn.max.executor.failures=8 - hudi:events->set --conf spark.yarn.am.memory=40g - hudi:events->set --conf spark.yarn.am.cores=2 - hudi:events->set --conf spark.executor.instances=10 ``` 4. `savepoint create --commit 20230809192242950 --sparkMaster yarn --sparkMemory 40G` 5. After several minutes command fails: ``` java.lang.OutOfMemoryError: Java heap space Failed: Could not create savepoint "20230809192242950". ``` **Expected behavior** savepoint should be created **Environment Description** * Hudi version : 0.13.0 * Spark version : 3.3.0 * Hive version : n/a * Hadoop version : 3.2.1 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** Executor logs show `Issue communicating with driver in heartbeater` and `TimeoutException: Cannot receive any reply from <driver.ip:port> in 10000 milliseconds` `--conf spark.executor.instances=10` setting was ignored - I still only got 2 executors. Would it help to try this with a hudi writeClient instead of hudi-cli? **Stacktrace** ``` 23/08/18 17:47:59 WARN Executor: Issue communicating with driver in heartbeater org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) ~[spark-core_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0] at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62) ~[spark-core_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0] at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58) ~[spark-core_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0] at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) ~[scala-library-2.12.15.jar:?] at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) ~[spark-core_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0] at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103) ~[spark-core_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0] at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1053) ~[spark-core_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0] at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:238) ~[spark-core_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.15.jar:?] at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2078) ~[spark-core_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0] at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46) ~[spark-core_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_382] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) ~[?:1.8.0_382] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_382] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) ~[?:1.8.0_382] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_382] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_382] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_382] Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263) ~[scala-library-2.12.15.jar:?] at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:293) ~[spark-core_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0] at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) ~[spark-core_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0] ... 13 more 23/08/18 17:48:09 WARN Executor: Issue communicating with driver in heartbeater ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
