[
https://issues.apache.org/jira/browse/HUDI-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772413#comment-17772413
]
Lin Liu commented on HUDI-6786:
-------------------------------
Reconstructed the integration PR and ran some tests, which reports some
conversion failures from internalrow to columnarbatch. I think the abstraction
layer should work in a rough way, and now I need to clean the logic inside of
the filegroupreader. The error message is as follows:
{code:java}
Caused by: java.lang.ClassCastException:
org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to
org.apache.spark.sql.vectorized.ColumnarBatch
at
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.next(DataSourceScanExec.scala:560)
at
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.next(DataSourceScanExec.scala:549)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750){code}
> Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR
> Snapshot Query
> --------------------------------------------------------------------------------------
>
> Key: HUDI-6786
> URL: https://issues.apache.org/jira/browse/HUDI-6786
> Project: Apache Hudi
> Issue Type: New Feature
> Reporter: Ethan Guo
> Assignee: Lin Liu
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Goal: When `NewHoodieParquetFileFormat` is enabled with
> `hoodie.datasource.read.use.new.parquet.file.format=true` on Spark, the MOR
> Snapshot query should use HoodieFileGroupReader. All relevant tests on basic
> MOR snapshot query should pass (except for the caveats in the current
> HoodieFileGroupReader, see other open tickets around HoodieFileGroupReader in
> this EPIC).
> The query logic is implemented in
> `NewHoodieParquetFileFormat#buildReaderWithPartitionValues`; see the
> following code for MOR snapshot query:
> {code:java}
> else {
> if (logFiles.nonEmpty) {
> val baseFile = createPartitionedFile(InternalRow.empty,
> hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen)
> buildMergeOnReadIterator(preMergeBaseFileReader(baseFile), logFiles,
> filePath.getParent, requiredSchemaWithMandatory,
> requiredSchemaWithMandatory, outputSchema, partitionSchema,
> partitionValues, broadcastedHadoopConf.value.value)
> } else {
> throw new IllegalStateException("should not be here since file slice
> should not have been broadcasted since it has no log or data files")
> //baseFileReader(baseFile)
> } {code}
> `buildMergeOnReadIterator` should be replaced by `HoodieFileGroupReader`,
> with a new config `hoodie.read.use.new.file.group.reader`, by passing in the
> correct base and log file list.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)