tomscut commented on issue #6897: URL: https://github.com/apache/incubator-gluten/issues/6897#issuecomment-2789262533
We had a similar problem and found that spark jobs were typically accompanied by exception codes 134. After increasing memory, it can be read normally. Enviroment: ``` Spark-3.4.1 + UniffleShuffleManager gluten-1.3.0 jdk-17.0.12 ``` The error message is: ``` [2025-04-09 18:50:07.155]Container exited with a non-zero exit code 134. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : /bin/bash: line 1: 1101965 Aborted LD_LIBRARY_PATH="/home/shop/platform/hadoop/lib/native:" /apps/svr/jdk-17.0.12/bin/java -server -Xmx450m '-Djava.net.preferIPv6Addresses=false' '-XX:+IgnoreUnrecognizedVMOptions' '--add-opens=java.base/java.lang=ALL-UNNAMED' '--add-opens=java.base/java.lang.invoke=ALL-UNNAMED' '--add-opens=java.base/java.lang.reflect=ALL-UNNAMED' '--add-opens=java.base/java.io=ALL-UNNAMED' '--add-opens=java.base/java.net=ALL-UNNAMED' '--add-opens=java.base/java.nio=ALL-UNNAMED' '--add-opens=java.base/java.util=ALL-UNNAMED' '--add-opens=java.base/java.util.concurrent=ALL-UNNAMED' '--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED' '--add-opens=java.base/sun.nio.ch=ALL-UNNAMED' '--add-opens=java.base/sun.nio.cs=ALL-UNNAMED' '--add-opens=java.base/sun.security.action=ALL-UNNAMED' '--add-opens=java.base/sun.util.calendar=ALL-UNNAMED' '--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED' '-Djdk.reflect.useDirectMethodHandle=false' '-Dio.netty.tryReflectionSetAccessible=true' -Djava.io.tmpdir=/home/shop/hard_disk/0/yarn/local/usercache/hdfs/appcache/application_1744125091376_297834/container_e33_1744125091376_297834_01_000007/tmp '-Dspark.network.timeout=120s' '-Dspark.driver.port=12795' '-Dspark.port.maxRetries=100' '-Dspark.rpc.askTimeout=120s' '-Dspark.rpc.lookupTimeout=120s' '-Dspark.rpc.message.maxSize=256' '-Dspark.ui.port=0' -Dspark.yarn.app.container.log.dir=/home/shop/hard_disk/0/yarn/logs/application_1744125091376_297834/container_e33_1744125091376_297834_01_000007 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://[email protected]:12795 --executor-id 4 --hostname xx-hadoop-nm-42-228-64.com --cores 4 --app-id application_1744125091376_297834 --resourceProfileId 0 > /home/shop/hard_disk/0/yarn/logs/application_1744125091376_297834/container_e33_1744125091376_297834_01_000007/stdout 2> /home/shop/hard_disk/0/yarn/logs/appli cation_1744125091376_297834/container_e33_1744125091376_297834_01_000007/stderr Last 4096 bytes of stderr : unch worker for task 0.3 in stage 5.0 (TID 9)] MapOutputTrackerWorker: Don't have map outputs for shuffle 1, fetching them 25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 5.0 (TID 9)] MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://[email protected]:12795) 25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 2.0 (TID 8)] TorrentBroadcast: Started reading broadcast variable 4 with 1 pieces (estimated total size 16.0 MiB) 25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 5.0 (TID 9)] MapOutputTrackerWorker: Got the map output locations 25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 2.0 (TID 8)] MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 34.2 KiB, free 4.1 GiB) 25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 2.0 (TID 8)] TorrentBroadcast: Reading broadcast variable 4 took 31 ms 25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 5.0 (TID 9)] RssShuffleManager: Get taskId cost 139 ms, and request expected blockIds from 1 tasks for shuffleId[1], partitionId[0, 1024] 25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 2.0 (TID 8)] MemoryStore: Block broadcast_4 stored as values in memory (estimated size 484.7 KiB, free 4.1 GiB) 25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 5.0 (TID 9)] TorrentBroadcast: Started reading broadcast variable 3 with 1 pieces (estimated total size 16.0 MiB) 25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 5.0 (TID 9)] MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.8 KiB, free 4.1 GiB) 25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 5.0 (TID 9)] TorrentBroadcast: Reading broadcast variable 3 took 33 ms 25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 5.0 (TID 9)] MemoryStore: Block broadcast_3 stored as values in memory (estimated size 293.5 KiB, free 4.1 GiB) 25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 5.0 (TID 9)] RssShuffleManager: Get shuffle blockId cost 1475 ms, and get 1024 blockIds for shuffleId[1], startPartition[0], endPartition[1024] 25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 5.0 (TID 9)] RssShuffleManager: Shuffle reader using remote storage hdfs://ssdcluster/user/uniffle/bear/shuffle_data,empty conf 25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 2.0 (TID 8)] CodecPool: Got brand-new decompressor [.snappy] 25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 2.0 (TID 8)] CodecPool: Got brand-new decompressor [.snappy] 25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 2.0 (TID 8)] CodecPool: Got brand-new decompressor [.snappy] 25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 2.0 (TID 8)] CodecPool: Got brand-new decompressor [.snappy] 25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 5.0 (TID 9)] deprecation: mapred.map.output.compression.codec is deprecated. Instead, use mapreduce.map.output.compress.codec 25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 5.0 (TID 9)] RssShuffleReader: Shuffle read started:appId=application_1744125091376_297834_1744195544475, shuffleId=1,taskId=9_3, partitions: [0, 1024), maps: [0, 2147483647) 25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 5.0 (TID 9)] BaseAllocator: Debug mode disabled. Enable with the VM option -Darrow.memory.debug.allocator=true. 25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 5.0 (TID 9)] DefaultAllocationManagerOption: allocation manager type not specified, using netty as the default type 25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 5.0 (TID 9)] CheckAllocator: Using DefaultAllocationManager at memory/DefaultAllocationManagerFactory.class free(): invalid pointer . Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2829) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2765) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2764) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2764) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1249) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1249) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1249) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3028) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2967) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2956) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
