ever4Kenny opened a new issue, #3579:
URL: https://github.com/apache/celeborn/issues/3579
### What is the bug(with logs or screenshots)?
26/01/05 16:30:15 ERROR ExecutorClassLoader: Failed to check existence of
class org.apache.spark.shuffle.celeborn.ColumnarHashBasedShuffleWriter on REPL
class server at
spark://dc05-prod-lan-hadoop-host-168159.host.idcvdian.com:24503/classes
java.lang.InterruptedException:
AbstractBootstrap$PendingRegistrationPromise@225ee7c4(incomplete)
at
io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:684)
at
io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:300)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:289)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:226)
at
org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:399)
at
org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$openChannel$4(NettyRpcEnv.scala:367)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1397)
at
org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:366)
at
org.apache.spark.executor.ExecutorClassLoader.getClassFileInputStreamFromSparkRPC(ExecutorClassLoader.scala:135)
at
org.apache.spark.executor.ExecutorClassLoader.$anonfun$fetchFn$1(ExecutorClassLoader.scala:66)
at
org.apache.spark.executor.ExecutorClassLoader.findClassLocally(ExecutorClassLoader.scala:176)
at
org.apache.spark.executor.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:113)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at
org.apache.celeborn.reflect.DynConstructors$Builder.impl(DynConstructors.java:158)
at
org.apache.spark.shuffle.celeborn.SparkUtils.<clinit>(SparkUtils.java:191)
at
org.apache.spark.shuffle.celeborn.CelebornShuffleDataIO.<init>(CelebornShuffleDataIO.java:41)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.spark.util.Utils$.$anonfun$loadExtensions$1(Utils.scala:2780)
at scala.collection.immutable.List.flatMap(List.scala:366)
at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:2772)
at
org.apache.spark.shuffle.ShuffleDataIOUtils$.loadShuffleDataIO(ShuffleDataIOUtils.scala:35)
at
org.apache.spark.shuffle.sort.SortShuffleManager$.org$apache$spark$shuffle$sort$SortShuffleManager$$loadShuffleExecutorComponents(SortShuffleManager.scala:253)
at
org.apache.spark.shuffle.sort.SortShuffleManager.shuffleExecutorComponents$lzycompute(SortShuffleManager.scala:88)
at
org.apache.spark.shuffle.sort.SortShuffleManager.shuffleExecutorComponents(SortShuffleManager.scala:88)
at
org.apache.spark.shuffle.sort.SortShuffleManager.getWriter(SortShuffleManager.scala:170)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:57)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:621)
at
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:624)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
26/01/05 16:30:15 ERROR Executor: Exception in task 26.1 in stage 1.0 (TID
147)
java.lang.ExceptionInInitializerError
at
org.apache.spark.shuffle.celeborn.CelebornShuffleDataIO.<init>(CelebornShuffleDataIO.java:41)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.spark.util.Utils$.$anonfun$loadExtensions$1(Utils.scala:2780)
at scala.collection.immutable.List.flatMap(List.scala:366)
at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:2772)
at
org.apache.spark.shuffle.ShuffleDataIOUtils$.loadShuffleDataIO(ShuffleDataIOUtils.scala:35)
at
org.apache.spark.shuffle.sort.SortShuffleManager$.org$apache$spark$shuffle$sort$SortShuffleManager$$loadShuffleExecutorComponents(SortShuffleManager.scala:253)
at
org.apache.spark.shuffle.sort.SortShuffleManager.shuffleExecutorComponents$lzycompute(SortShuffleManager.scala:88)
at
org.apache.spark.shuffle.sort.SortShuffleManager.shuffleExecutorComponents(SortShuffleManager.scala:88)
at
org.apache.spark.shuffle.sort.SortShuffleManager.getWriter(SortShuffleManager.scala:170)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:57)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:621)
at
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:624)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException:
AbstractBootstrap$PendingRegistrationPromise@225ee7c4(incomplete)
at
io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:684)
at
io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:300)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:289)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:226)
at
org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:399)
at
org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$openChannel$4(NettyRpcEnv.scala:367)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1397)
at
org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:366)
at
org.apache.spark.executor.ExecutorClassLoader.getClassFileInputStreamFromSparkRPC(ExecutorClassLoader.scala:135)
at
org.apache.spark.executor.ExecutorClassLoader.$anonfun$fetchFn$1(ExecutorClassLoader.scala:66)
at
org.apache.spark.executor.ExecutorClassLoader.findClassLocally(ExecutorClassLoader.scala:176)
at
org.apache.spark.executor.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:113)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at
org.apache.celeborn.reflect.DynConstructors$Builder.impl(DynConstructors.java:158)
at
org.apache.spark.shuffle.celeborn.SparkUtils.<clinit>(SparkUtils.java:191)
... 26 more
26/01/05 16:30:16 INFO CelebornShuffleDataIO: Loading CelebornShuffleDataIO
26/01/05 16:30:16 ERROR Executor: Exception in task 26.0 in stage 2.0 (TID
237)
java.lang.NoClassDefFoundError: Could not initialize class
org.apache.spark.shuffle.celeborn.SparkUtils
at
org.apache.spark.shuffle.celeborn.CelebornShuffleDataIO.<init>(CelebornShuffleDataIO.java:41)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.spark.util.Utils$.$anonfun$loadExtensions$1(Utils.scala:2780)
at scala.collection.immutable.List.flatMap(List.scala:366)
at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:2772)
at
org.apache.spark.shuffle.ShuffleDataIOUtils$.loadShuffleDataIO(ShuffleDataIOUtils.scala:35)
at
org.apache.spark.shuffle.sort.SortShuffleManager$.org$apache$spark$shuffle$sort$SortShuffleManager$$loadShuffleExecutorComponents(SortShuffleManager.scala:253)
at
org.apache.spark.shuffle.sort.SortShuffleManager.shuffleExecutorComponents$lzycompute(SortShuffleManager.scala:88)
at
org.apache.spark.shuffle.sort.SortShuffleManager.shuffleExecutorComponents(SortShuffleManager.scala:88)
at
org.apache.spark.shuffle.sort.SortShuffleManager.getWriter(SortShuffleManager.scala:170)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:57)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:621)
at
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:624)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
### How to reproduce the bug?
Steps to reproduce the bug.
On spark 3.5, when we enable celeborn with
spark.shuffle.sort.io.plugin.class=org.apache.spark.shuffle.celeborn.CelebornShuffleDataIO,
we were not able to start executor becuase SparkUtils failed to initialize
even if we did not use columar shuffle.
That's because SparkUtils staticly loaded
org.apache.spark.shuffle.celeborn.ColumnarHashBasedShuffleWriter, which was not
included in the client spark shaded jar.
To workaround this:
1. Manually build the
[spark-3.5-columnar-shuffle](https://github.com/apache/celeborn/tree/main/client-spark/spark-3.5-columnar-shuffle)
module with mvn, and place the result jar to the spark classpath
2. Apply the
assets/spark-patch/Celeborn-Optimize-Skew-Partitions-spark3_5_6.patch to spark.
However, this is irrational, 'cause we're not intended to use columnar
shuffle and leave the setting celeborn.columnarShuffle.enabled=false. Plus, no
doc ever mentioned this, any spark 3.5 user new to this project will stumble
into this issue and get the spark job failed.
Ideally we should on-demand loading the columnar class when enabled and well
document the behavior.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]