tomscut commented on issue #6897:
URL: 
https://github.com/apache/incubator-gluten/issues/6897#issuecomment-2789262533

   We had a similar problem and found that spark jobs were typically 
accompanied by exception codes 134. After increasing memory, it can be read 
normally.
   
   Enviroment:
   ```
   Spark-3.4.1 + UniffleShuffleManager
   gluten-1.3.0 
   jdk-17.0.12
   ```
   
   The error message is:
   ```
   [2025-04-09 18:50:07.155]Container exited with a non-zero exit code 134. 
Error file: prelaunch.err.
   Last 4096 bytes of prelaunch.err :
   /bin/bash: line 1: 1101965 Aborted                 
LD_LIBRARY_PATH="/home/shop/platform/hadoop/lib/native:" 
/apps/svr/jdk-17.0.12/bin/java -server -Xmx450m 
'-Djava.net.preferIPv6Addresses=false' '-XX:+IgnoreUnrecognizedVMOptions' 
'--add-opens=java.base/java.lang=ALL-UNNAMED' 
'--add-opens=java.base/java.lang.invoke=ALL-UNNAMED' 
'--add-opens=java.base/java.lang.reflect=ALL-UNNAMED' 
'--add-opens=java.base/java.io=ALL-UNNAMED' 
'--add-opens=java.base/java.net=ALL-UNNAMED' 
'--add-opens=java.base/java.nio=ALL-UNNAMED' 
'--add-opens=java.base/java.util=ALL-UNNAMED' 
'--add-opens=java.base/java.util.concurrent=ALL-UNNAMED' 
'--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED' 
'--add-opens=java.base/sun.nio.ch=ALL-UNNAMED' 
'--add-opens=java.base/sun.nio.cs=ALL-UNNAMED' 
'--add-opens=java.base/sun.security.action=ALL-UNNAMED' 
'--add-opens=java.base/sun.util.calendar=ALL-UNNAMED' 
'--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED' 
'-Djdk.reflect.useDirectMethodHandle=false' 
 '-Dio.netty.tryReflectionSetAccessible=true' 
-Djava.io.tmpdir=/home/shop/hard_disk/0/yarn/local/usercache/hdfs/appcache/application_1744125091376_297834/container_e33_1744125091376_297834_01_000007/tmp
 '-Dspark.network.timeout=120s' '-Dspark.driver.port=12795' 
'-Dspark.port.maxRetries=100' '-Dspark.rpc.askTimeout=120s' 
'-Dspark.rpc.lookupTimeout=120s' '-Dspark.rpc.message.maxSize=256' 
'-Dspark.ui.port=0' 
-Dspark.yarn.app.container.log.dir=/home/shop/hard_disk/0/yarn/logs/application_1744125091376_297834/container_e33_1744125091376_297834_01_000007
 -XX:OnOutOfMemoryError='kill %p' 
org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url 
spark://[email protected]:12795 --executor-id 4 --hostname 
xx-hadoop-nm-42-228-64.com --cores 4 --app-id application_1744125091376_297834 
--resourceProfileId 0 > 
/home/shop/hard_disk/0/yarn/logs/application_1744125091376_297834/container_e33_1744125091376_297834_01_000007/stdout
 2> /home/shop/hard_disk/0/yarn/logs/appli
 cation_1744125091376_297834/container_e33_1744125091376_297834_01_000007/stderr
   Last 4096 bytes of stderr :
   unch worker for task 0.3 in stage 5.0 (TID 9)] MapOutputTrackerWorker: Don't 
have map outputs for shuffle 1, fetching them
   25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 
5.0 (TID 9)] MapOutputTrackerWorker: Doing the fetch; tracker endpoint = 
NettyRpcEndpointRef(spark://[email protected]:12795)
   25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 
2.0 (TID 8)] TorrentBroadcast: Started reading broadcast variable 4 with 1 
pieces (estimated total size 16.0 MiB)
   25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 
5.0 (TID 9)] MapOutputTrackerWorker: Got the map output locations
   25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 
2.0 (TID 8)] MemoryStore: Block broadcast_4_piece0 stored as bytes in memory 
(estimated size 34.2 KiB, free 4.1 GiB)
   25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 
2.0 (TID 8)] TorrentBroadcast: Reading broadcast variable 4 took 31 ms
   25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 
5.0 (TID 9)] RssShuffleManager: Get taskId cost 139 ms, and request expected 
blockIds from 1 tasks for shuffleId[1], partitionId[0, 1024]
   25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 
2.0 (TID 8)] MemoryStore: Block broadcast_4 stored as values in memory 
(estimated size 484.7 KiB, free 4.1 GiB)
   25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 
5.0 (TID 9)] TorrentBroadcast: Started reading broadcast variable 3 with 1 
pieces (estimated total size 16.0 MiB)
   25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 
5.0 (TID 9)] MemoryStore: Block broadcast_3_piece0 stored as bytes in memory 
(estimated size 2.8 KiB, free 4.1 GiB)
   25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 
5.0 (TID 9)] TorrentBroadcast: Reading broadcast variable 3 took 33 ms
   25/04/09 18:50:04 INFO [Executor task launch worker for task 0.3 in stage 
5.0 (TID 9)] MemoryStore: Block broadcast_3 stored as values in memory 
(estimated size 293.5 KiB, free 4.1 GiB)
   25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 
5.0 (TID 9)] RssShuffleManager: Get shuffle blockId cost 1475 ms, and get 1024 
blockIds for shuffleId[1], startPartition[0], endPartition[1024]
   25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 
5.0 (TID 9)] RssShuffleManager: Shuffle reader using remote storage 
hdfs://ssdcluster/user/uniffle/bear/shuffle_data,empty conf
   25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 
2.0 (TID 8)] CodecPool: Got brand-new decompressor [.snappy]
   25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 
2.0 (TID 8)] CodecPool: Got brand-new decompressor [.snappy]
   25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 
2.0 (TID 8)] CodecPool: Got brand-new decompressor [.snappy]
   25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 
2.0 (TID 8)] CodecPool: Got brand-new decompressor [.snappy]
   25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 
5.0 (TID 9)] deprecation: mapred.map.output.compression.codec is deprecated. 
Instead, use mapreduce.map.output.compress.codec
   25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 
5.0 (TID 9)] RssShuffleReader: Shuffle read 
started:appId=application_1744125091376_297834_1744195544475, 
shuffleId=1,taskId=9_3, partitions: [0, 1024), maps: [0, 2147483647)
   25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 
5.0 (TID 9)] BaseAllocator: Debug mode disabled. Enable with the VM option 
-Darrow.memory.debug.allocator=true.
   25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 
5.0 (TID 9)] DefaultAllocationManagerOption: allocation manager type not 
specified, using netty as the default type
   25/04/09 18:50:06 INFO [Executor task launch worker for task 0.3 in stage 
5.0 (TID 9)] CheckAllocator: Using DefaultAllocationManager at 
memory/DefaultAllocationManagerFactory.class
   free(): invalid pointer
   
   .
   Driver stacktrace:
        at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2829)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2765)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2764)
        at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2764)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1249)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1249)
        at scala.Option.foreach(Option.scala:407)
        at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1249)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3028)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2967)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2956)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to