lifulong commented on issue #10436:
URL:
https://github.com/apache/incubator-gluten/issues/10436#issuecomment-3187497092
ExecutorLostFailure (executor 1080 exited caused by one of the running
tasks) Reason: Container from a bad node:
container_e200059_1752315269200_3167805_01_001589 on host:
kdata1685.tsn01.rack.zhihu.com. Exit status: 134. Diagnostics:
legacy.timeParserPolicy, LEGACY]
[spark.gluten.ugi.username, tc_warehouse]
[spark.sql.mapKeyDedupPolicy, EXCEPTION]
[spark.shuffle.file.buffer, 65536]
[spark.sql.caseSensitive, false]
[spark.shuffle.spill.compress, true]
[spark.gluten.memory.task.offHeap.size.in.bytes, 1073741824]
[spark.gluten.memory.backtrace.allocation, false]
[spark.gluten.sql.debug, true]
[spark.gluten.sql.columnarToRowMemoryThreshold, 67108864]
[spark.redaction.regex,
(?i)secret|password|token|access[.]key|zookeeper.auth.digest]
[spark.gluten.sql.columnar.backend.velox.threads, 4]
[spark.gluten.ugi.tokens,
HQoQCgwIvazBARDQoITx_zIQARCB8NvX_v____8BFPok-8GiV982gX1l6eTiThfv65d4EFlBUk5fQU1fUk1fVE9LRU4dMTI3LjAuMC4xOjgwNDksMTI3LjAuMC4xOjgwNDkCHBhc3N3b3JkCEtZeXNkYUYxCHBhc3N3b3JkCHBhc3N3b3Jk]
I20250814 14:12:01.689005 3618619 FileSystems.cpp:212]
LocalFileSystem::mkdir
/data6/data/hadoop/yarn/local/usercache/tc_warehouse/appcache/application_1752315269200_3167805/gluten-da1c9896-f263-4c2d-ac06-184396fc37ba/gluten-spill/19a746ea-97b8-4a53-8817-763627ad1411
I20250814 14:12:07.497653 3618619 VeloxMemoryManager.cc:274]
Shrink[root/root]: Trying to shrink 41943040 bytes of data...
I20250814 14:12:07.497735 3618619 VeloxMemoryManager.cc:275]
Shrink[root/root]: Pool has reserved
726016512/747634688/747634688/9223372036854775807 bytes.
I20250814 14:12:07.497747 3618619 VeloxMemoryManager.cc:277]
Shrink[root/root]: Shrinking...
I20250814 14:12:07.497756 3618619 VeloxMemoryManager.cc:279]
Shrink[root/root]: 0 bytes released from shrinking.
I20250814 14:12:07.497807 3618619 VeloxMemoryManager.cc:274]
Shrink[root/root]: Trying to shrink 41943040 bytes of data...
I20250814 14:12:07.497814 3618619 VeloxMemoryManager.cc:275]
Shrink[root/root]: Pool has reserved 131072/1048576/8388608/9223372036854775807
bytes.
I20250814 14:12:07.497820 3618619 VeloxMemoryManager.cc:277]
Shrink[root/root]: Shrinking...
I20250814 14:12:07.497846 3618619 VeloxMemoryManager.cc:279]
Shrink[root/root]: 7340032 bytes released from shrinking.
I20250814 14:12:07.497867 3618619 VeloxMemoryManager.cc:274]
Shrink[root/root]: Trying to shrink 34603008 bytes of data...
I20250814 14:12:07.497874 3618619 VeloxMemoryManager.cc:275]
Shrink[root/root]: Pool has reserved 0/0/0/9223372036854775807 bytes.
I20250814 14:12:07.497880 3618619 VeloxMemoryManager.cc:277]
Shrink[root/root]: Shrinking...
I20250814 14:12:07.497886 3618619 VeloxMemoryManager.cc:279]
Shrink[root/root]: 0 bytes released from shrinking.
I20250814 14:12:07.497913 3618619 VeloxMemoryManager.cc:274]
Shrink[root/root]: Trying to shrink 34603008 bytes of data...
I20250814 14:12:07.497920 3618619 VeloxMemoryManager.cc:275]
Shrink[root/root]: Pool has reserved
726016512/747634688/747634688/9223372036854775807 bytes.
I20250814 14:12:07.497927 3618619 VeloxMemoryManager.cc:277]
Shrink[root/root]: Shrinking...
I20250814 14:12:07.497934 3618619 VeloxMemoryManager.cc:279]
Shrink[root/root]: 0 bytes released from shrinking.
I20250814 14:12:07.497941 3618619 WholeStageResultIterator.cc:244]
Spill[root/root]: trying to request spill for 33.00MB.
I20250814 14:12:09.362419 3618619 WholeStageResultIterator.cc:248]
Spill[root/root]: successfully reclaimed total 375.00MB with shrunken 0B and
spilled 375.00MB.
25/08/14 14:12:13 INFO SaslDataTransferClient: SASL encryption trust check:
localHostTrusted = false, remoteHostTrusted = false
[2025-08-14 14:13:02.202]Container exited with a non-zero exit code 134.
Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 3576580 Aborted (core dumped)
LD_LIBRARY_PATH="/usr/lib/hadoop/lzo/lib:/usr/lib/hadoop/lib/native:"
/usr/java/jdk/bin/java -server -Xmx2048m '-Djava.net.preferIPv6Addresses=false'
'-XX:+IgnoreUnrecognizedVMOptions'
'--add-opens=java.base/java.lang=ALL-UNNAMED'
'--add-opens=java.base/java.lang.invoke=ALL-UNNAMED'
'--add-opens=java.base/java.lang.reflect=ALL-UNNAMED'
'--add-opens=java.base/java.io=ALL-UNNAMED'
'--add-opens=java.base/java.net=ALL-UNNAMED'
'--add-opens=java.base/java.nio=ALL-UNNAMED'
'--add-opens=java.base/java.util=ALL-UNNAMED'
'--add-opens=java.base/java.util.concurrent=ALL-UNNAMED'
'--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED'
'--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED'
'--add-opens=java.base/sun.nio.ch=ALL-UNNAMED'
'--add-opens=java.base/sun.nio.cs=ALL-UNNAMED'
'--add-opens=java.base/sun.security.action=ALL-UNNAMED'
'--add-opens=java.base/sun.util.calendar=ALL-UNNAMED'
'--add-opens=java.security.jgss/su
n.security.krb5=ALL-UNNAMED' '-Djdk.reflect.useDirectMethodHandle=false'
'-Xss4m'
'-javaagent:jvm-profiler-1.0.0.jar=reporter=com.uber.profiling.reporters.HTTPOutputReporter,httpServer=http://profiler.tsn01.in.zhihu.com/v1/create_memory_stats,httpServerToken=3747ad16a05ee96be1a38a9ecd7bfbc1,enableSubtree=true'
-Djava.io.tmpdir=/data4/data/hadoop/yarn/local/usercache/tc_warehouse/appcache/application_1752315269200_3167805/container_e200059_1752315269200_3167805_01_001589/tmp
'-Dspark.kyuubi.metrics.prometheus.port=9598' '-Dspark.network.timeout=3600s'
'-Dspark.rpc.askTimeout=120s' '-Dspark.driver.port=39091'
'-Dspark.rpc.lookupTimeout=120s' '-Dspark.ui.port=0'
-Dspark.yarn.app.container.log.dir=/data1/data/hadoop/yarn/logs/application_1752315269200_3167805/container_e200059_1752315269200_3167805_01_001589
-XX:OnOutOfMemoryError='kill %p'
org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url
spark://[email protected]:39091 --executo
r-id 1080 --hostname kdata1685.tsn01.rack.zhihu.com --cores 4 --app-id
application_1752315269200_3167805 --resourceProfileId 0 >
/data1/data/hadoop/yarn/logs/application_1752315269200_3167805/container_e200059_1752315269200_3167805_01_001589/stdout
2>
/data1/data/hadoop/yarn/logs/application_1752315269200_3167805/container_e200059_1752315269200_3167805_01_001589/stderr
Last 4096 bytes of stderr :
lity":"NULLABILITY_NULLABLE"}},"arguments":[{"value":{"cast":{"type":{"date":{"nullability":"NULLABILITY_NULLABLE"}},"input":{"selection":{"directReference":{"structField":{"field":2}}}},"failureBehavior":"FAILURE_BEHAVIOR_RETURN_NULL"}}},{"value":{"cast":{"type":{"date":{"nullability":"NULLABILITY_NULLABLE"}},"input":{"selection":{"directReference":{"structField":{}}}},"failureBehavior":"FAILURE_BEHAVIOR_RETURN_NULL"}}}]}},"failureBehavior":"FAILURE_BEHAVIOR_RETURN_NULL"}},{"literal":{"string":"2025-08-13"}}]}}}}}}]}
I20250814 14:12:01.688426 3618619 VeloxRuntime.cc:152] VeloxRuntime session
config:
[spark.gluten.sql.session.timeZone.default, Asia/Harbin]
[spark.sql.files.ignoreMissingFiles, false]
[spark.sql.legacy.timeParserPolicy, LEGACY]
[spark.gluten.ugi.username, tc_warehouse]
[spark.sql.mapKeyDedupPolicy, EXCEPTION]
[spark.shuffle.file.buffer, 65536]
[spark.sql.caseSensitive, false]
[spark.shuffle.spill.compress, true]
[spark.gluten.memory.task.offHeap.size.in.bytes, 1073741824]
[spark.gluten.memory.backtrace.allocation, false]
[spark.gluten.sql.debug, true]
[spark.gluten.sql.columnarToRowMemoryThreshold, 67108864]
[spark.redaction.regex,
(?i)secret|password|token|access[.]key|zookeeper.auth.digest]
[spark.gluten.sql.columnar.backend.velox.threads, 4]
[spark.gluten.ugi.tokens,
HQoQCgwIvazBARDQoITx_zIQARCB8NvX_v____8BFPok-8GiV982gX1l6eTiThfv65d4EFlBUk5fQU1fUk1fVE9LRU4dMTI3LjAuMC4xOjgwNDksMTI3LjAuMC4xOjgwNDkCHBhc3N3b3JkCEtZeXNkYUYxCHBhc3N3b3JkCHBhc3N3b3Jk]
I20250814 14:12:01.689005 3618619 FileSystems.cpp:212]
LocalFileSystem::mkdir
/data6/data/hadoop/yarn/local/usercache/tc_warehouse/appcache/application_1752315269200_3167805/gluten-da1c9896-f263-4c2d-ac06-184396fc37ba/gluten-spill/19a746ea-97b8-4a53-8817-763627ad1411
I20250814 14:12:07.497653 3618619 VeloxMemoryManager.cc:274]
Shrink[root/root]: Trying to shrink 41943040 bytes of data...
I20250814 14:12:07.497735 3618619 VeloxMemoryManager.cc:275]
Shrink[root/root]: Pool has reserved
726016512/747634688/747634688/9223372036854775807 bytes.
I20250814 14:12:07.497747 3618619 VeloxMemoryManager.cc:277]
Shrink[root/root]: Shrinking...
I20250814 14:12:07.497756 3618619 VeloxMemoryManager.cc:279]
Shrink[root/root]: 0 bytes released from shrinking.
I20250814 14:12:07.497807 3618619 VeloxMemoryManager.cc:274]
Shrink[root/root]: Trying to shrink 41943040 bytes of data...
I20250814 14:12:07.497814 3618619 VeloxMemoryManager.cc:275]
Shrink[root/root]: Pool has reserved 131072/1048576/8388608/9223372036854775807
bytes.
I20250814 14:12:07.497820 3618619 VeloxMemoryManager.cc:277]
Shrink[root/root]: Shrinking...
I20250814 14:12:07.497846 3618619 VeloxMemoryManager.cc:279]
Shrink[root/root]: 7340032 bytes released from shrinking.
I20250814 14:12:07.497867 3618619 VeloxMemoryManager.cc:274]
Shrink[root/root]: Trying to shrink 34603008 bytes of data...
I20250814 14:12:07.497874 3618619 VeloxMemoryManager.cc:275]
Shrink[root/root]: Pool has reserved 0/0/0/9223372036854775807 bytes.
I20250814 14:12:07.497880 3618619 VeloxMemoryManager.cc:277]
Shrink[root/root]: Shrinking...
I20250814 14:12:07.497886 3618619 VeloxMemoryManager.cc:279]
Shrink[root/root]: 0 bytes released from shrinking.
I20250814 14:12:07.497913 3618619 VeloxMemoryManager.cc:274]
Shrink[root/root]: Trying to shrink 34603008 bytes of data...
I20250814 14:12:07.497920 3618619 VeloxMemoryManager.cc:275]
Shrink[root/root]: Pool has reserved
726016512/747634688/747634688/9223372036854775807 bytes.
I20250814 14:12:07.497927 3618619 VeloxMemoryManager.cc:277]
Shrink[root/root]: Shrinking...
I20250814 14:12:07.497934 3618619 VeloxMemoryManager.cc:279]
Shrink[root/root]: 0 bytes released from shrinking.
I20250814 14:12:07.497941 3618619 WholeStageResultIterator.cc:244]
Spill[root/root]: trying to request spill for 33.00MB.
I20250814 14:12:09.362419 3618619 WholeStageResultIterator.cc:248]
Spill[root/root]: successfully reclaimed total 375.00MB with shrunken 0B and
spilled 375.00MB.
25/08/14 14:12:13 INFO SaslDataTransferClient: SASL encryption trust check:
localHostTrusted = false, remoteHostTrusted = false
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]