boneanxs opened a new issue, #11188:
URL: https://github.com/apache/hudi/issues/11188

   We recently also met the issue https://github.com/apache/hudi/issues/9305, 
but with the different cause(we still use hudi 0.12).
   
   The user set the configure `spark.sql.parquet.enableVectorizedReader` to 
false manually, and read a hive table and cache it. Given spark will analyze 
the plan firstly if it needs to be cached, so currently spark won't add `C2R` 
to that cached plan since vectorized reader is false. At currently, spark won't 
execute that plan since there's no action operator.
   
   Then user tries to read a MOR read_optimized table and join that cached plan 
and get the result, as mor table will automatically update the 
`enableVectorizedReader` to true, actually that hive table is read as column 
batch, but the plan doesn't contain `C2R` to convert the batch to row, whereas 
the error occurs:
   
   ![Screenshot 2024-05-10 at 18 32 
22](https://github.com/apache/hudi/assets/10115332/14b387e0-ecee-4c04-9aff-ba024ce3af55)
   
   ```java
   ava.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch 
cannot be cast to org.apache.spark.sql.catalyst.InternalRow
        at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
        at 
org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.hasNext(InMemoryRelation.scala:118)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223)
        at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:302)
        at 
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1481)
        at 
   ```
   ```scala
     override def imbueConfigs(sqlContext: SQLContext): Unit = {
       super.imbueConfigs(sqlContext)
       
sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.enableVectorizedReader",
 "true")
     }
   ```
   
   I see there's some modification in the master code, but I suspect this issue 
could still happen since we'd also modify it in 
`HoodieFileGroupReaderBasedParquetFileFormat`:
   
   ```scala
   spark.conf.set("spark.sql.parquet.enableVectorizedReader", 
supportBatchResult)
   ```
   
   Besides this issue, Is it suitable to set spark configures globally? No 
matter users set it or not, I actually see hudi could set many spark relate 
configures in `SparkConf`, most of them are related to parquet reader/writer. 
This could confuse users and make it hard for devs to find the cause.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to