my7ym opened a new issue, #6439: URL: https://github.com/apache/incubator-gluten/issues/6439
### Backend VL (Velox) ### Bug description Trying to set up Gluten + Spark 3.4.2 on AWS EMR. Spark: Open source spark with Spark default configs (not EMR default) downloaded with `wget https://archive.apache.org/dist/spark/spark-3.4.2/spark-3.4.2-bin-hadoop3.tgz` Gluten: Download with the link: https://github.com/apache/incubator-gluten/releases/download/v1.1.1/gluten-velox-bundle-spark3.4_2.12-1.1.1.jar and added to $SPARK_HOME/jars to play safe. Hadoop: Hadoop 3.3.0 on EMR 6.15.0 with configs default on EMR. Command: ``` $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client --conf spark.plugins=io.glutenproject.GlutenPlugin --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=20g --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager --conf spark.executorEnv.LIBHDFS3_CONF=/etc/hadoop/conf/hdfs-client.xml --conf spark.executor.extraClassPath=/home/hadoop/gluten-velox-bundle-spark3.4_2.12-1.1.1.jar ``` ``` scala> spark.read.parquet("hdfs:///test_data").show() ``` [Expected behavior] Spark could read data from HDFS. [actual behavior]. The statement failed with the following error message ``` 24/07/13 20:33:40 WARN TaskSetManager: Lost task 0.0 in stage 6.0 (TID 9) (ip-blahbla.internal executor 2): io.glutenproject.exception.GlutenException: java.lang.RuntimeException: Exception: VeloxRuntimeError Error Source: RUNTIME Error Code: INVALID_STATE Reason: Read failure in HDFSReadFile::preadInternal. Retriable: False Expression: bytesRead >= 0 Context: Split [Hive: hdfs://ip-blahbla.internal:8020/test_data/part-00000-9ca88cd6-6afc-4c05-97d8-6f21bba40d84-c000.snappy.parquet 0 - 691] Task Gluten_Stage_6_TID_9 Top-Level Context: Same as context. Function: read File: /root/src/oap-project/gluten/ep/build-velox/build/velox_ep/velox/connectors/hive/storage_adapters/hdfs/HdfsReadFile.h Line: 53 Stack trace: # 0 _ZN8facebook5velox7process10StackTraceC1Ei # 1 _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_ # 2 _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorEPKcEEvRKNS1_18VeloxCheckFailArgsET0_ # 3 _ZNK8facebook5velox12HdfsReadFile13preadInternalEmmPc # 4 _ZNK8facebook5velox12HdfsReadFile5preadEmmPv # 5 _ZNK8facebook5velox8ReadFile6preadvEmRKSt6vectorIN5folly5RangeIPcEESaIS6_EE # 6 _ZN8facebook5velox4dwio6common19ReadFileInputStream4readERKSt6vectorIN5folly5RangeIPcEESaIS8_EEmNS2_10MetricsLog11MetricsTypeE # 7 _ZN8facebook5velox4dwio6common19DirectCoalescedLoad8loadDataEb # 8 _ZN8facebook5velox5cache13CoalescedLoad12loadOrFutureEPN5folly10SemiFutureIbEE # 9 _ZN8facebook5velox4dwio6common17DirectInputStream12loadPositionEv # 10 _ZN8facebook5velox4dwio6common17DirectInputStream4NextEPPKvPi # 11 _ZN8facebook5velox7parquet10ReaderBase16loadFileMetaDataEv # 12 _ZN8facebook5velox7parquet10ReaderBaseC1ESt10unique_ptrINS0_4dwio6common13BufferedInputESt14default_deleteIS6_EERKNS5_13ReaderOptionsE # 13 _ZN8facebook5velox7parquet13ParquetReaderC2ESt10unique_ptrINS0_4dwio6common13BufferedInputESt14default_deleteIS6_EERKNS5_13ReaderOptionsE # 14 _ZN8facebook5velox7parquet20ParquetReaderFactory12createReaderESt10unique_ptrINS0_4dwio6common13BufferedInputESt14default_deleteIS6_EERKNS5_13ReaderOptionsE # 15 _ZN8facebook5velox9connector4hive11SplitReader12prepareSplitESt10shared_ptrINS0_6common14MetadataFilterEERNS0_4dwio6common17RuntimeStatisticsE # 16 _ZN8facebook5velox9connector4hive14HiveDataSource8addSplitESt10shared_ptrINS1_14ConnectorSplitEE # 17 _ZN8facebook5velox4exec9TableScan9getOutputEv # 18 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE # 19 _ZN8facebook5velox4exec6Driver4nextERSt10shared_ptrINS1_13BlockingStateEE # 20 _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE # 21 _ZN6gluten24WholeStageResultIterator4nextEv # 22 Java_io_glutenproject_vectorized_ColumnarBatchOutIterator_nativeHasNext # 23 0x00007ff7c9018427 at io.glutenproject.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:39) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45) at io.glutenproject.utils.InvocationFlowProtection.hasNext(Iterators.scala:135) at io.glutenproject.utils.IteratorCompleter.hasNext(Iterators.scala:69) at io.glutenproject.utils.PayloadCloser.hasNext(Iterators.scala:35) at io.glutenproject.utils.PipelineTimeAccumulator.hasNext(Iterators.scala:98) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator.isEmpty(Iterator.scala:387) at scala.collection.Iterator.isEmpty$(Iterator.scala:387) at org.apache.spark.InterruptibleIterator.isEmpty(InterruptibleIterator.scala:28) at io.glutenproject.execution.VeloxColumnarToRowExec$.toRowIterator(VeloxColumnarToRowExec.scala:116) at io.glutenproject.execution.VeloxColumnarToRowExec.$anonfun$doExecuteInternal$1(VeloxColumnarToRowExec.scala:80) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:853) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:853) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) ``` Sometimes, after this statement, some executors could also crash with the following error error message in the container log: ``` # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007f52b3ce8dbc, pid=24849, tid=0x00007f52bba0f700 # # JRE version: OpenJDK Runtime Environment (8.0_412-b08) (build 1.8.0_412-b08) # Java VM: OpenJDK 64-Bit Server VM (25.412-b08 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [libvelox.so+0x542fdbc] hdfsBuilderSetNameNode+0x2c # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /mnt1/yarn/usercache/hadoop/appcache/application_1720835486566_0009/container_1720835486566_0009_01_000003/hs_err_pid24849.log # # If you would like to submit a bug report, please visit: # https://github.com/corretto/corretto-8/issues/ # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. ``` I suspected it was caused by it could not find libhdfs3 and tried to install it with conda on the executor instance and link it with ``` --conf spark.executor.extraLibraryPath=/root/anaconda2/pkgs/libhdfs3-2.3-0/lib/libhdfs3.so.2.2.31 ``` but it did not help. I also tried to read & write data with S3 and it failed with ``` FileSink is not registered for s3://blahblah/_temporary/0/_temporary/attempt_202407111748436837404674407678922_0000_m_000000_5/ ``` Thanks for helping! ### Spark version Spark-3.4.x ### Spark configurations Default open source Spark 3.4.2 config. ### System information _No response_ ### Relevant logs _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
