[GitHub] [hudi] rahil-c commented on pull request #8682: [DO NOT MERGE] [HUDI-6198] Run gh actions with Spark 3.4.0

via GitHub Mon, 22 May 2023 16:34:35 -0700


rahil-c commented on PR #8682:
URL: https://github.com/apache/hudi/pull/8682#issuecomment-1558191054


   Hi @danny0405  @xiarixiaoyao, we are trying to upgrade spark to 3.4.0 in 
hudi. However we are facing issues with several functional test failures due to 
another casting exception. For example when running the test
   `TestAvroSchemaResolutionSupport#testArrayOfMapsChangeValueType ` we hit the 
following issue
   
   `java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.UnsafeRow cannot be cast to 
org.apache.spark.sql.vectorized.ColumnarBatch` 
   
   ```
   java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.UnsafeRow cannot be cast to 
org.apache.spark.sql.vectorized.ColumnarBatch
   2023-05-16T01:46:35.0110639Z  at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.next(DataSourceScanExec.scala:600)
   2023-05-16T01:46:35.0110882Z  at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.next(DataSourceScanExec.scala:589)
   2023-05-16T01:46:35.0111237Z  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
   2023-05-16T01:46:35.0111621Z  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   2023-05-16T01:46:35.0111933Z  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   2023-05-16T01:46:35.0112206Z  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
   2023-05-16T01:46:35.0112432Z  at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
   2023-05-16T01:46:35.0112618Z  at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
   2023-05-16T01:46:35.0112814Z  at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
   2023-05-16T01:46:35.0113021Z  at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   2023-05-16T01:46:35.0113220Z  at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
   2023-05-16T01:46:35.0113374Z  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
   2023-05-16T01:46:35.0113580Z  at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
   
   ```
   I think we can get past this by having vectorized reader disabled as well as 
code gen disabled but I do not think these are acceptable workarounds. Was 
wondering if we can get your thoughts, would be happy to sync offline at some 
point to provide findings as well. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] rahil-c commented on pull request #8682: [DO NOT MERGE] [HUDI-6198] Run gh actions with Spark 3.4.0

Reply via email to