[GitHub] [hudi] YuweiXiao opened a new issue, #5740: [SUPPORT] Failed to read hudi table with decimal type using spark

GitBox Thu, 02 Jun 2022 01:24:19 -0700


YuweiXiao opened a new issue, #5740:
URL: https://github.com/apache/hudi/issues/5740


   **Describe the problem you faced**
   
   Currently, spark may have problem reading parquet file with decimal type 
value 
[[link](https://stackoverflow.com/questions/63578928/spark-unable-to-read-decimal-columns-in-parquet-files-written-by-avroparquetwrit)].
 A workaround is to set `spark.sql.parquet.enableVectorizedReader=false`. 
   
   However, in some spark hudi reading paths, we will explicitly set the config 
to true [[commit](https://github.com/apache/hudi/pull/5168)]. 
   
   Though it might sound hacky, we need to auto-set this config based on the 
schema, e.g., turn off if we find there is decimal type. And I maybe we also 
need to respect those spark configs from users, rather than overriding it 
directly.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. create a hudi table with decimal type column
   2. use spark to read the table
   
   **Expected behavior**
   
   
   **Environment Description**
   
   * Hudi version : master
   
   * Spark version : 2.4.4
   
   * Hive version : -
   
   * Hadoop version : -
   
   * Storage (HDFS/S3/GCS..) : local
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```
   Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
to stage failure: Task 1 in stage 4.0 failed 1 times, most recent failure: Lost 
task 1.0 in stage 4.0 (TID 5, localhost, executor driver): 
org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:250)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:497)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:220)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
        at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at 
org.apache.hudi.HoodieMergeOnReadRDD$RecordMergingFileIterator.hasNextInternal(HoodieMergeOnReadRDD.scala:273)
        at 
org.apache.hudi.HoodieMergeOnReadRDD$RecordMergingFileIterator.hasNext(HoodieMergeOnReadRDD.scala:267)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] YuweiXiao opened a new issue, #5740: [SUPPORT] Failed to read hudi table with decimal type using spark

Reply via email to