[jira] [Updated] (HUDI-6786) Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR Snapshot Query

Ethan Guo (Jira) Mon, 02 Oct 2023 08:38:42 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ethan Guo updated HUDI-6786:
----------------------------
    Description: 
Goal: When `NewHoodieParquetFileFormat` is enabled with 
`hoodie.datasource.read.use.new.parquet.file.format=true` on Spark, the MOR 
Snapshot query should use HoodieFileGroupReader.  All relevant tests on basic 
MOR snapshot query should pass (except for the caveats in the current 
HoodieFileGroupReader, see other open tickets around HoodieFileGroupReader in 
this EPIC).

The query logic is implemented in 
`NewHoodieParquetFileFormat#buildReaderWithPartitionValues`; see the following 
code for MOR snapshot query:
{code:java}
else {
  if (logFiles.nonEmpty) {
    val baseFile = createPartitionedFile(InternalRow.empty, 
hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen)
    buildMergeOnReadIterator(preMergeBaseFileReader(baseFile), logFiles, 
filePath.getParent, requiredSchemaWithMandatory,
      requiredSchemaWithMandatory, outputSchema, partitionSchema, 
partitionValues, broadcastedHadoopConf.value.value)
  } else {
    throw new IllegalStateException("should not be here since file slice should 
not have been broadcasted since it has no log or data files")
    //baseFileReader(baseFile)
  } {code}
`buildMergeOnReadIterator` should be replaced by `HoodieFileGroupReader`, with 
a new config `hoodie.read.use.new.file.group.reader`.

  was:Goal: When `NewHoodieParquetFileFormat` is enabled with 
`hoodie.datasource.read.use.new.parquet.file.format=true` on Spark, the MOR 
Snapshot query should use 


> Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR 
> Snapshot Query
> --------------------------------------------------------------------------------------
>
>                 Key: HUDI-6786
>                 URL: https://issues.apache.org/jira/browse/HUDI-6786
>             Project: Apache Hudi
>          Issue Type: New Feature
>            Reporter: Ethan Guo
>            Assignee: Lin Liu
>            Priority: Blocker
>             Fix For: 1.0.0
>
>
> Goal: When `NewHoodieParquetFileFormat` is enabled with 
> `hoodie.datasource.read.use.new.parquet.file.format=true` on Spark, the MOR 
> Snapshot query should use HoodieFileGroupReader.  All relevant tests on basic 
> MOR snapshot query should pass (except for the caveats in the current 
> HoodieFileGroupReader, see other open tickets around HoodieFileGroupReader in 
> this EPIC).
> The query logic is implemented in 
> `NewHoodieParquetFileFormat#buildReaderWithPartitionValues`; see the 
> following code for MOR snapshot query:
> {code:java}
> else {
>   if (logFiles.nonEmpty) {
>     val baseFile = createPartitionedFile(InternalRow.empty, 
> hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen)
>     buildMergeOnReadIterator(preMergeBaseFileReader(baseFile), logFiles, 
> filePath.getParent, requiredSchemaWithMandatory,
>       requiredSchemaWithMandatory, outputSchema, partitionSchema, 
> partitionValues, broadcastedHadoopConf.value.value)
>   } else {
>     throw new IllegalStateException("should not be here since file slice 
> should not have been broadcasted since it has no log or data files")
>     //baseFileReader(baseFile)
>   } {code}
> `buildMergeOnReadIterator` should be replaced by `HoodieFileGroupReader`, 
> with a new config `hoodie.read.use.new.file.group.reader`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6786) Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR Snapshot Query

Reply via email to