Riefu opened a new issue, #2118:
URL: https://github.com/apache/auron/issues/2118

   ## Description
   
   We encountered an issue in production where Auron native execution fails 
intermittently on a single executor with:
   java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.auron.jni.AuronCallNativeWrapper
   
   The failure is isolated to a single executor JVM. Other executors on the 
same node continue to work normally.
   
   Once triggered, all subsequent tasks scheduled on that executor consistently 
fail with the same error until the executor or application is restarted.
   
   ---
   
   ## Relevant Logs
   
   Initial failure:
   
   INFO AuronCallNativeWrapper: Initializing native environment 
(batchSize=10000, memoryFraction=0.6)
   
   INFO Executor: Executor is trying to kill task ..., reason: Stage cancelled
   
   26/03/25 19:01:45 INFO AuronCallNativeWrapper: Initializing native 
environment (batchSize=10000, memoryFraction=0.6)
   26/03/25 19:01:51 INFO Executor: Executor is trying to kill task 9.0 in 
stage 8157.0 (TID 52888), reason: Stage cancelled
   26/03/25 19:01:51 INFO Executor: Executor is trying to kill task 49.0 in 
stage 8157.0 (TID 52928), reason: Stage cancelled
   26/03/25 19:01:51 INFO Executor: Executor is trying to kill task 89.0 in 
stage 8157.0 (TID 52968), reason: Stage cancelled
   26/03/25 19:01:51 INFO Executor: Executor is trying to kill task 29.0 in 
stage 8157.0 (TID 52908), reason: Stage cancelled
   26/03/25 19:01:51 INFO Executor: Executor is trying to kill task 69.0 in 
stage 8157.0 (TID 52948), reason: Stage cancelled
   26/03/25 19:01:51 ERROR Executor: Exception in task 49.0 in stage 8157.0 
(TID 52928)
   java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.auron.jni.AuronCallNativeWrapper
        at 
org.apache.spark.sql.auron.NativeHelper$.executeNativePlan(NativeHelper.scala:103)
        at 
org.apache.spark.sql.execution.auron.shuffle.AuronShuffleWriterBase.nativeShuffleWrite(AuronShuffleWriterBase.scala:67)
        at 
org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.internalWrite(NativeShuffleExchangeExec.scala:187)
        at 
org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.write(NativeShuffleExchangeExec.scala:131)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   26/03/25 19:01:51 ERROR Executor: Exception in task 69.0 in stage 8157.0 
(TID 52948)
   java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.auron.jni.AuronCallNativeWrapper
        at 
org.apache.spark.sql.auron.NativeHelper$.executeNativePlan(NativeHelper.scala:103)
        at 
org.apache.spark.sql.execution.auron.shuffle.AuronShuffleWriterBase.nativeShuffleWrite(AuronShuffleWriterBase.scala:67)
        at 
org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.internalWrite(NativeShuffleExchangeExec.scala:187)
        at 
org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.write(NativeShuffleExchangeExec.scala:131)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   26/03/25 19:01:51 ERROR Executor: Exception in task 89.0 in stage 8157.0 
(TID 52968)
   java.lang.ExceptionInInitializerError
        at 
org.apache.spark.sql.auron.NativeHelper$.executeNativePlan(NativeHelper.scala:103)
        at 
org.apache.spark.sql.execution.auron.shuffle.AuronShuffleWriterBase.nativeShuffleWrite(AuronShuffleWriterBase.scala:67)
        at 
org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.internalWrite(NativeShuffleExchangeExec.scala:187)
        at 
org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.write(NativeShuffleExchangeExec.scala:131)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   Caused by: java.lang.IllegalStateException: error loading native libraries: 
java.nio.channels.ClosedByInterruptException
        at 
org.apache.auron.jni.SparkAuronAdaptor.loadAuronLib(SparkAuronAdaptor.java:52)
        at 
org.apache.auron.jni.AuronCallNativeWrapper.<clinit>(AuronCallNativeWrapper.java:74)
        ... 13 more
   26/03/25 19:01:51 ERROR Executor: Exception in task 9.0 in stage 8157.0 (TID 
52888)
   java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.auron.jni.AuronCallNativeWrapper
        at 
org.apache.spark.sql.auron.NativeHelper$.executeNativePlan(NativeHelper.scala:103)
        at 
org.apache.spark.sql.execution.auron.shuffle.AuronShuffleWriterBase.nativeShuffleWrite(AuronShuffleWriterBase.scala:67)
        at 
org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.internalWrite(NativeShuffleExchangeExec.scala:187)
        at 
org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.write(NativeShuffleExchangeExec.scala:131)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   26/03/25 19:01:51 ERROR Executor: Exception in task 29.0 in stage 8157.0 
(TID 52908)
   java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.auron.jni.AuronCallNativeWrapper
        at 
org.apache.spark.sql.auron.NativeHelper$.executeNativePlan(NativeHelper.scala:103)
        at 
org.apache.spark.sql.execution.auron.shuffle.AuronShuffleWriterBase.nativeShuffleWrite(AuronShuffleWriterBase.scala:67)
        at 
org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.internalWrite(NativeShuffleExchangeExec.scala:187)
        at 
org.apache.spark.sql.execution.auron.plan.NativeShuffleExchangeExec$$anon$1.write(NativeShuffleExchangeExec.scala:131)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   
   
   Subsequent failures:
   
   java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.auron.jni.AuronCallNativeWrapper
   
   
   
   ---
   
   ## Observed Behavior
   
   - Issue happens on **only one executor JVM**
   - Other executors on the same node are unaffected
   - Tasks keep failing **only when scheduled to that specific executor**
   - Restarting the application (or executor) resolves the issue
   - Failure occurs during **native shuffle execution path**
   
   ---
   
   ## Root Cause Analysis (Our Understanding)
   
   Based on logs and behavior, we believe the root cause is:
   
   1. Auron native library is **lazily initialized** in `AuronCallNativeWrapper`
   2. Multiple tasks concurrently attempt initialization (lock contention)
   3. During initialization, Spark triggers **stage cancellation**, 
interrupting task threads
   4. Native library loading (likely involving file/channel operations) throws:
      `ClosedByInterruptException`
   5. This causes static initialization (`<clinit>`) to fail
   6. JVM marks the class as **failed initialization**
   7. All subsequent usages in the same executor result in:
      `NoClassDefFoundError: Could not initialize class`
   
   This effectively **poisons the entire executor JVM**.
   
   ---
   
   ## Impact
   
   - A single executor becomes permanently unusable for Auron tasks
   - Tasks repeatedly fail if scheduled on that executor
   - Requires executor/application restart to recover
   - Can cause prolonged instability (we observed ~20 minutes impact)
   
   ---
   
   ## Expected Behavior
   
   - Native library initialization should be:
     - resilient to task interruption, OR
     - retriable after failure, OR
     - isolated from task lifecycle (not tied to task threads)
   
   - Executor should not enter unrecoverable state due to a transient interrupt
   
   ---
   
   ## Suggestions
   
   1. **Avoid lazy initialization in task execution path**
      - Preload native libraries at executor startup
   
   2. **Make initialization interrupt-safe**
      - Ignore or defer thread interrupts during critical JNI loading
   
   3. **Allow retry after initialization failure**
      - Avoid permanently poisoning class state
   
   4. **Fail fast on executor**
      - If initialization fails, terminate executor instead of leaving it in 
broken state
   
   5. **Reduce lock contention during initialization**
      - Ensure only one thread performs initialization without exposing others 
to failure
   
   ---
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to