[jira] [Comment Edited] (HUDI-6786) Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR Snapshot Query

Lin Liu (Jira) Wed, 04 Oct 2023 16:26:05 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771993#comment-17771993
 ]


Lin Liu edited comment on HUDI-6786 at 10/4/23 11:25 PM:
---------------------------------------------------------

Updates:
 # Have resolved the serialization problem, whose solution is to rely on the 
broadcast of sparksession.
 # Resolved one of the NullPointerException due to the initialization of the 
iterator.
 # Now I am facing a NullPointerException caused by the 
sparksession.sessionstate. 

 
{code:java}
Caused by: java.lang.NullPointerException
    at 
org.apache.spark.sql.execution.datasources.parquet.Spark33LegacyHoodieParquetFileFormat.buildReaderWithPartitionValues(Spark33LegacyHoodieParquetFileFormat.scala:114)
    at 
org.apache.spark.sql.execution.datasources.parquet.LegacyHoodieParquetFileFormat.buildReaderWithPartitionValues(LegacyHoodieParquetFileFormat.scala:62)
    at 
org.apache.hudi.SparkFileFormatInternalRowReaderContext.getFileRecordIterator(SparkFileFormatInternalRowReaderContext.scala:60)
    at 
org.apache.hudi.common.table.read.HoodieFileGroupReader.initRecordIterators(HoodieFileGroupReader.java:139)
    at 
org.apache.spark.sql.execution.datasources.parquet.NewHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$3(NewHoodieParquetFileFormat.scala:179)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:209)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:270)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown
 Source)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
    at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:136)
    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750) {code}


was (Author: JIRAUSER301185):
Updates:
 # Have resolved the serialization problem, whose solution is to rely on the 
broadcast of sparksession.
 # Resolved one of the NullPointerException due to the initialization of the 
iterator.
 # Now I am facing a NullPointerException caused by the 
sparksession.sessionstate. 

> Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR 
> Snapshot Query
> --------------------------------------------------------------------------------------
>
>                 Key: HUDI-6786
>                 URL: https://issues.apache.org/jira/browse/HUDI-6786
>             Project: Apache Hudi
>          Issue Type: New Feature
>            Reporter: Ethan Guo
>            Assignee: Lin Liu
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.0.0
>
>
> Goal: When `NewHoodieParquetFileFormat` is enabled with 
> `hoodie.datasource.read.use.new.parquet.file.format=true` on Spark, the MOR 
> Snapshot query should use HoodieFileGroupReader.  All relevant tests on basic 
> MOR snapshot query should pass (except for the caveats in the current 
> HoodieFileGroupReader, see other open tickets around HoodieFileGroupReader in 
> this EPIC).
> The query logic is implemented in 
> `NewHoodieParquetFileFormat#buildReaderWithPartitionValues`; see the 
> following code for MOR snapshot query:
> {code:java}
> else {
>   if (logFiles.nonEmpty) {
>     val baseFile = createPartitionedFile(InternalRow.empty, 
> hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen)
>     buildMergeOnReadIterator(preMergeBaseFileReader(baseFile), logFiles, 
> filePath.getParent, requiredSchemaWithMandatory,
>       requiredSchemaWithMandatory, outputSchema, partitionSchema, 
> partitionValues, broadcastedHadoopConf.value.value)
>   } else {
>     throw new IllegalStateException("should not be here since file slice 
> should not have been broadcasted since it has no log or data files")
>     //baseFileReader(baseFile)
>   } {code}
> `buildMergeOnReadIterator` should be replaced by `HoodieFileGroupReader`, 
> with a new config `hoodie.read.use.new.file.group.reader`, by passing in the 
> correct base and log file list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HUDI-6786) Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR Snapshot Query

Reply via email to