[I] Gluten + Spark 3.4.2 failed to read from HDFS on AWS EMR [incubator-gluten]

via GitHub Sat, 13 Jul 2024 13:44:54 -0700


my7ym opened a new issue, #6439:
URL: https://github.com/apache/incubator-gluten/issues/6439


   ### Backend
   
   VL (Velox)
   
   ### Bug description
   
   Trying to set up Gluten + Spark 3.4.2 on AWS EMR. 
   Spark:
   Open source spark with Spark default configs (not EMR default) downloaded 
with `wget 
https://archive.apache.org/dist/spark/spark-3.4.2/spark-3.4.2-bin-hadoop3.tgz`
   
   Gluten: 
   Download with the link: 
https://github.com/apache/incubator-gluten/releases/download/v1.1.1/gluten-velox-bundle-spark3.4_2.12-1.1.1.jar
 and added to $SPARK_HOME/jars to play safe.
   
   Hadoop:
   Hadoop 3.3.0 on EMR 6.15.0 with configs default on EMR.
   
   Command:
   ```
   $SPARK_HOME/bin/spark-shell  --master yarn --deploy-mode client  --conf 
spark.plugins=io.glutenproject.GlutenPlugin  --conf 
spark.memory.offHeap.enabled=true  --conf spark.memory.offHeap.size=20g  --conf 
spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager 
--conf spark.executorEnv.LIBHDFS3_CONF=/etc/hadoop/conf/hdfs-client.xml --conf 
spark.executor.extraClassPath=/home/hadoop/gluten-velox-bundle-spark3.4_2.12-1.1.1.jar
   ```
   
   ```
   scala> spark.read.parquet("hdfs:///test_data").show()
   ```
   
   [Expected behavior]
   Spark could read data from HDFS.
   
   
   [actual behavior].
   The statement failed with the following error message
   ```
   24/07/13 20:33:40 WARN TaskSetManager: Lost task 0.0 in stage 6.0 (TID 9) 
(ip-blahbla.internal executor 2): io.glutenproject.exception.GlutenException: 
java.lang.RuntimeException: Exception: VeloxRuntimeError
   Error Source: RUNTIME
   Error Code: INVALID_STATE
   Reason: Read failure in HDFSReadFile::preadInternal.
   Retriable: False
   Expression: bytesRead >= 0
   Context: Split [Hive: 
hdfs://ip-blahbla.internal:8020/test_data/part-00000-9ca88cd6-6afc-4c05-97d8-6f21bba40d84-c000.snappy.parquet
 0 - 691] Task Gluten_Stage_6_TID_9
   Top-Level Context: Same as context.
   Function: read
   File: 
/root/src/oap-project/gluten/ep/build-velox/build/velox_ep/velox/connectors/hive/storage_adapters/hdfs/HdfsReadFile.h
   Line: 53
   Stack trace:
   # 0  _ZN8facebook5velox7process10StackTraceC1Ei
   # 1  
_ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
   # 2  
_ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorEPKcEEvRKNS1_18VeloxCheckFailArgsET0_
   # 3  _ZNK8facebook5velox12HdfsReadFile13preadInternalEmmPc
   # 4  _ZNK8facebook5velox12HdfsReadFile5preadEmmPv
   # 5  
_ZNK8facebook5velox8ReadFile6preadvEmRKSt6vectorIN5folly5RangeIPcEESaIS6_EE
   # 6  
_ZN8facebook5velox4dwio6common19ReadFileInputStream4readERKSt6vectorIN5folly5RangeIPcEESaIS8_EEmNS2_10MetricsLog11MetricsTypeE
   # 7  _ZN8facebook5velox4dwio6common19DirectCoalescedLoad8loadDataEb
   # 8  
_ZN8facebook5velox5cache13CoalescedLoad12loadOrFutureEPN5folly10SemiFutureIbEE
   # 9  _ZN8facebook5velox4dwio6common17DirectInputStream12loadPositionEv
   # 10 _ZN8facebook5velox4dwio6common17DirectInputStream4NextEPPKvPi
   # 11 _ZN8facebook5velox7parquet10ReaderBase16loadFileMetaDataEv
   # 12 
_ZN8facebook5velox7parquet10ReaderBaseC1ESt10unique_ptrINS0_4dwio6common13BufferedInputESt14default_deleteIS6_EERKNS5_13ReaderOptionsE
   # 13 
_ZN8facebook5velox7parquet13ParquetReaderC2ESt10unique_ptrINS0_4dwio6common13BufferedInputESt14default_deleteIS6_EERKNS5_13ReaderOptionsE
   # 14 
_ZN8facebook5velox7parquet20ParquetReaderFactory12createReaderESt10unique_ptrINS0_4dwio6common13BufferedInputESt14default_deleteIS6_EERKNS5_13ReaderOptionsE
   # 15 
_ZN8facebook5velox9connector4hive11SplitReader12prepareSplitESt10shared_ptrINS0_6common14MetadataFilterEERNS0_4dwio6common17RuntimeStatisticsE
   # 16 
_ZN8facebook5velox9connector4hive14HiveDataSource8addSplitESt10shared_ptrINS1_14ConnectorSplitEE
   # 17 _ZN8facebook5velox4exec9TableScan9getOutputEv
   # 18 
_ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE
   # 19 
_ZN8facebook5velox4exec6Driver4nextERSt10shared_ptrINS1_13BlockingStateEE
   # 20 _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
   # 21 _ZN6gluten24WholeStageResultIterator4nextEv
   # 22 Java_io_glutenproject_vectorized_ColumnarBatchOutIterator_nativeHasNext
   # 23 0x00007ff7c9018427
   
        at 
io.glutenproject.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:39)
        at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
        at 
io.glutenproject.utils.InvocationFlowProtection.hasNext(Iterators.scala:135)
        at io.glutenproject.utils.IteratorCompleter.hasNext(Iterators.scala:69)
        at io.glutenproject.utils.PayloadCloser.hasNext(Iterators.scala:35)
        at 
io.glutenproject.utils.PipelineTimeAccumulator.hasNext(Iterators.scala:98)
        at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator.isEmpty(Iterator.scala:387)
        at scala.collection.Iterator.isEmpty$(Iterator.scala:387)
        at 
org.apache.spark.InterruptibleIterator.isEmpty(InterruptibleIterator.scala:28)
        at 
io.glutenproject.execution.VeloxColumnarToRowExec$.toRowIterator(VeloxColumnarToRowExec.scala:116)
        at 
io.glutenproject.execution.VeloxColumnarToRowExec.$anonfun$doExecuteInternal$1(VeloxColumnarToRowExec.scala:80)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:853)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:853)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
        at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   ```
   
   Sometimes, after this statement, some executors could also crash with the 
following error error message in the container log:
   ```
   #
   # A fatal error has been detected by the Java Runtime Environment:
   #
   #  SIGSEGV (0xb) at pc=0x00007f52b3ce8dbc, pid=24849, tid=0x00007f52bba0f700
   #
   # JRE version: OpenJDK Runtime Environment (8.0_412-b08) (build 
1.8.0_412-b08)
   # Java VM: OpenJDK 64-Bit Server VM (25.412-b08 mixed mode linux-amd64 
compressed oops)
   # Problematic frame:
   # C  [libvelox.so+0x542fdbc]  hdfsBuilderSetNameNode+0x2c
   #
   # Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
   #
   # An error report file with more information is saved as:
   # 
/mnt1/yarn/usercache/hadoop/appcache/application_1720835486566_0009/container_1720835486566_0009_01_000003/hs_err_pid24849.log
   #
   # If you would like to submit a bug report, please visit:
   #   https://github.com/corretto/corretto-8/issues/
   # The crash happened outside the Java Virtual Machine in native code.
   # See problematic frame for where to report the bug.
   ```
   
   I suspected it was caused by it could not find libhdfs3 and tried to install 
it with conda on the executor instance and link it with 
   ```
   --conf 
spark.executor.extraLibraryPath=/root/anaconda2/pkgs/libhdfs3-2.3-0/lib/libhdfs3.so.2.2.31
   ```
   but it did not help.
   
   I also tried to read & write data with S3 and it failed with 
   ```
   FileSink is not registered for 
s3://blahblah/_temporary/0/_temporary/attempt_202407111748436837404674407678922_0000_m_000000_5/
   ```
   
   Thanks for helping!
   
   ### Spark version
   
   Spark-3.4.x
   
   ### Spark configurations
   
   Default open source Spark 3.4.2 config.
   
   ### System information
   
   _No response_
   
   ### Relevant logs
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Gluten + Spark 3.4.2 failed to read from HDFS on AWS EMR [incubator-gluten]

Reply via email to