YuweiXiao opened a new issue, #5740: URL: https://github.com/apache/hudi/issues/5740
**Describe the problem you faced** Currently, spark may have problem reading parquet file with decimal type value [[link](https://stackoverflow.com/questions/63578928/spark-unable-to-read-decimal-columns-in-parquet-files-written-by-avroparquetwrit)]. A workaround is to set `spark.sql.parquet.enableVectorizedReader=false`. However, in some spark hudi reading paths, we will explicitly set the config to true [[commit](https://github.com/apache/hudi/pull/5168)]. Though it might sound hacky, we need to auto-set this config based on the schema, e.g., turn off if we find there is decimal type. And I maybe we also need to respect those spark configs from users, rather than overriding it directly. **To Reproduce** Steps to reproduce the behavior: 1. create a hudi table with decimal type column 2. use spark to read the table **Expected behavior** **Environment Description** * Hudi version : master * Spark version : 2.4.4 * Hive version : - * Hadoop version : - * Storage (HDFS/S3/GCS..) : local * Running on Docker? (yes/no) : no **Additional context** Add any other context about the problem here. **Stacktrace** ``` Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in stage 4.0 (TID 5, localhost, executor driver): org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:250) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:497) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:220) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.hudi.HoodieMergeOnReadRDD$RecordMergingFileIterator.hasNextInternal(HoodieMergeOnReadRDD.scala:273) at org.apache.hudi.HoodieMergeOnReadRDD$RecordMergingFileIterator.hasNext(HoodieMergeOnReadRDD.scala:267) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
