Tomccat3 opened a new issue, #8136:
URL: https://github.com/apache/incubator-gluten/issues/8136
### Backend
VL (Velox)
### Bug description
1. use main branch
2. run run_tpcds.sh
3. failed at Q1
4. it seems hdfs read file error, but file is exists in hdfs.
5. run successfully without gluten
### Spark version
Spark-3.5.x
### Spark configurations
GLUTEN_JAR=/usr/lib/spark/jars/gluten-velox-bundle-spark3.5_2.12-amzn_2_x86_64-1.3.0-SNAPSHOT.jar
SPARK_HOME=/usr/lib/spark
cat tpcds_parquet.scala | ${SPARK_HOME}/bin/spark-shell \
--master yarn --deploy-mode client \
--conf spark.plugins=org.apache.gluten.GlutenPlugin \
--conf spark.driver.extraClassPath=${GLUTEN_JAR} \
--conf spark.executor.extraClassPath=${GLUTEN_JAR} \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=18g \
--conf spark.gluten.sql.columnar.forceShuffledHashJoin=true \
--conf
spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
--num-executors 2 \
--executor-cores 1 \
--driver-memory 4g \
--executor-memory 3g \
--conf spark.executor.memoryOverhead=2g \
--conf spark.driver.maxResultSize=2g \
--conf spark.executorEnv.JAVA_HOME="/usr/lib/jvm/java-1.8.0" \
--conf spark.yarn.appMasterEnv.JAVA_HOME="/usr/lib/jvm/java-1.8.0" \
--conf spark.gluten.sql.columnar.backend.velox.IOThreads=0
### System information
Gluten Version: 1.3.0-SNAPSHOT
Commit: 95a7c95bdf749a047f5cf9bb5ba5e2552d29a6cd
CMake Version: 3.22.2
System: Linux-6.1.112-124.190.amzn2023.x86_64
Arch: x86_64
CPU Name: Model name: Intel(R) Xeon(R) Platinum
8259CL CPU @ 2.50GHz
BIOS Model name: Intel(R) Xeon(R) Platinum 8259CL CPU @
2.50GHz
C++ Compiler: /usr/bin/c++
C++ Compiler Version: 11.4.1
C Compiler: /usr/bin/cc
C Compiler Version: 11.4.1
CMake Prefix Path: /usr/local;/usr;/;/usr;/usr/local;/usr/X11R6;/usr/pkg;/opt
### Relevant logs
```bash
24/12/03 11:43:39 INFO Executor: Executor is trying to kill task 0.0 in
stage 34.0 (TID 12059), reason: Stage cancelled: Job 34 cancelled as part of
cancellation of all jobs
hdfsGetPathInfo(/tmp/tpch/date_dim/part-00000-ef370f7e-fc73-4d45-a1e0-222639c968c6-c000.snappy.parquet):
getFileInfo error:
InterruptedIOException: Call interruptedjava.io.InterruptedIOException: Call
interrupted
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1669)
at org.apache.hadoop.ipc.Client.call(Client.java:1621)
at org.apache.hadoop.ipc.Client.call(Client.java:1518)
at
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
at
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
at com.sun.proxy.$Proxy51.getFileInfo(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.lambda$getFileInfo$42(ClientNamenodeProtocolTranslatorPB.java:831)
at
org.apache.hadoop.ipc.internal.ShadedProtobufHelper.ipc(ShadedProtobufHelper.java:160)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:831)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:437)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:170)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:162)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:100)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:366)
at com.sun.proxy.$Proxy52.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1770)
at
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1835)
at
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1832)
at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1847)
at
org.apache.gluten.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native
Method)
at
org.apache.gluten.vectorized.ColumnarBatchOutIterator.hasNext0(ColumnarBatchOutIterator.java:57)
at
org.apache.gluten.iterator.ClosableIterator.hasNext(ClosableIterator.java:39)
at
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at
org.apache.gluten.iterator.IteratorsV1$InvocationFlowProtection.hasNext(IteratorsV1.scala:159)
at
org.apache.gluten.iterator.IteratorsV1$IteratorCompleter.hasNext(IteratorsV1.scala:71)
at
org.apache.gluten.iterator.IteratorsV1$PayloadCloser.hasNext(IteratorsV1.scala:37)
at
org.apache.gluten.iterator.IteratorsV1$LifeTimeAccumulator.hasNext(IteratorsV1.scala:100)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at
scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
at scala.collection.AbstractIterator.to(Iterator.scala:1431)
at
scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
at
scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
at
scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
at
scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
at
org.apache.spark.sql.execution.BroadcastUtils$.serializeStream(BroadcastUtils.scala:151)
at
org.apache.gluten.backendsapi.velox.VeloxSparkPlanExecApi.$anonfun$createBroadcastRelation$1(VeloxSparkPlanExecApi.scala:625)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:859)
at
org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:859)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:61)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:368)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:332)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:61)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:368)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:332)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
at
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:174)
at org.apache.spark.scheduler.Task.run(Task.scala:152)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:632)
at
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:96)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:635)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
E20241203 11:43:39.798856 951225 Exceptions.h:66] Line:
/opt/gluten/ep/build-velox/build/velox_ep/velox/connectors/hive/storage_adapters/hdfs/HdfsReadFile.cpp:36,
Function:HdfsReadFile, Expression: Unable to get file path info for file:
/tmp/tpch/date_dim/part-00000-ef370f7e-fc73-4d45-a1e0-222639c968c6-c000.snappy.parquet.
got error: FileNotFoundException: Path
/tmp/tpch/date_dim/part-00000-ef370f7e-fc73-4d45-a1e0-222639c968c6-c000.snappy.parquet
does not exist., Source: RUNTIME, ErrorCode: FILE_NOT_FOUND
24/12/03 11:43:39 ERROR TaskResources: Task 12059 failed by error:
org.apache.gluten.exception.GlutenException:
org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: FILE_NOT_FOUND
Reason: Unable to get file path info for file:
/tmp/tpch/date_dim/part-00000-ef370f7e-fc73-4d45-a1e0-222639c968c6-c000.snappy.parquet.
got error: FileNotFoundException: Path
/tmp/tpch/date_dim/part-00000-ef370f7e-fc73-4d45-a1e0-222639c968c6-c000.snappy.parquet
does not exist.
Retriable: False
Context: Split [Hive:
hdfs://10.0.75.184:8020/tmp/tpch/date_dim/part-00000-ef370f7e-fc73-4d45-a1e0-222639c968c6-c000.snappy.parquet
0 - 1868849] Task Gluten_Stage_34_TID_12059_VTID_2
Additional Context: Operator: TableScan[0] 0
Function: HdfsReadFile
File:
/opt/gluten/ep/build-velox/build/velox_ep/velox/connectors/hive/storage_adapters/hdfs/HdfsReadFile.cpp
Line: 36
Stack trace:
# 0 _ZN8facebook5velox7process10StackTraceC1Ei
# 1
_ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2
_ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKSsEEvRKNS1_18VeloxCheckFailArgsET0_
# 3
_ZN8facebook5velox12HdfsReadFileC2EPNS0_11filesystems5arrow2io8internal11LibHdfsShimEP13hdfs_internalSt17basic_string_viewIcSt11char_traitsIcEE
# 4
_ZN8facebook5velox11filesystems14HdfsFileSystem15openFileForReadESt17basic_string_viewIcSt11char_traitsIcEERKNS1_11FileOptionsE
# 5 _ZN8facebook5velox19FileHandleGeneratorclERKSsPKNS0_14FilePropertiesE
# 6
_ZN8facebook5velox13CachedFactoryISsNS0_10FileHandleENS0_19FileHandleGeneratorENS0_14FilePropertiesENS0_15FileHandleSizerESt8equal_toISsESt4hashISsEE8generateERKSsPKS4_
# 7
_ZN8facebook5velox9connector4hive11SplitReader12createReaderESt10shared_ptrINS0_6common14MetadataFilterEE
# 8
_ZN8facebook5velox9connector4hive11SplitReader12prepareSplitESt10shared_ptrINS0_6common14MetadataFilterEERNS0_4dwio6common17RuntimeStatisticsE
# 9
_ZN8facebook5velox9connector4hive14HiveDataSource8addSplitESt10shared_ptrINS1_14ConnectorSplitEE
# 10 _ZN8facebook5velox4exec9TableScan9getOutputEv
# 11
_ZZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEEENKUlvE3_clEv
# 12
_ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE
# 13 _ZN8facebook5velox4exec6Driver4nextEPN5folly10SemiFutureINS3_4UnitEEE
# 14 _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
# 15 _ZN6gluten24WholeStageResultIterator4nextEv
# 16 Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeHasNext
# 17 0x00007f383d018427
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]