[
https://issues.apache.org/jira/browse/HUDI-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ethan Guo updated HUDI-6786:
----------------------------
Description:
Goal: When `NewHoodieParquetFileFormat` is enabled with
`hoodie.datasource.read.use.new.parquet.file.format=true` on Spark, the MOR
Snapshot query should use HoodieFileGroupReader. All relevant tests on basic
MOR snapshot query should pass (except for the caveats in the current
HoodieFileGroupReader, see other open tickets around HoodieFileGroupReader in
this EPIC).
The query logic is implemented in
`NewHoodieParquetFileFormat#buildReaderWithPartitionValues`; see the following
code for MOR snapshot query:
{code:java}
else {
if (logFiles.nonEmpty) {
val baseFile = createPartitionedFile(InternalRow.empty,
hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen)
buildMergeOnReadIterator(preMergeBaseFileReader(baseFile), logFiles,
filePath.getParent, requiredSchemaWithMandatory,
requiredSchemaWithMandatory, outputSchema, partitionSchema,
partitionValues, broadcastedHadoopConf.value.value)
} else {
throw new IllegalStateException("should not be here since file slice should
not have been broadcasted since it has no log or data files")
//baseFileReader(baseFile)
} {code}
`buildMergeOnReadIterator` should be replaced by `HoodieFileGroupReader`, with
a new config `hoodie.read.use.new.file.group.reader`.
was:Goal: When `NewHoodieParquetFileFormat` is enabled with
`hoodie.datasource.read.use.new.parquet.file.format=true` on Spark, the MOR
Snapshot query should use
> Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR
> Snapshot Query
> --------------------------------------------------------------------------------------
>
> Key: HUDI-6786
> URL: https://issues.apache.org/jira/browse/HUDI-6786
> Project: Apache Hudi
> Issue Type: New Feature
> Reporter: Ethan Guo
> Assignee: Lin Liu
> Priority: Blocker
> Fix For: 1.0.0
>
>
> Goal: When `NewHoodieParquetFileFormat` is enabled with
> `hoodie.datasource.read.use.new.parquet.file.format=true` on Spark, the MOR
> Snapshot query should use HoodieFileGroupReader. All relevant tests on basic
> MOR snapshot query should pass (except for the caveats in the current
> HoodieFileGroupReader, see other open tickets around HoodieFileGroupReader in
> this EPIC).
> The query logic is implemented in
> `NewHoodieParquetFileFormat#buildReaderWithPartitionValues`; see the
> following code for MOR snapshot query:
> {code:java}
> else {
> if (logFiles.nonEmpty) {
> val baseFile = createPartitionedFile(InternalRow.empty,
> hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen)
> buildMergeOnReadIterator(preMergeBaseFileReader(baseFile), logFiles,
> filePath.getParent, requiredSchemaWithMandatory,
> requiredSchemaWithMandatory, outputSchema, partitionSchema,
> partitionValues, broadcastedHadoopConf.value.value)
> } else {
> throw new IllegalStateException("should not be here since file slice
> should not have been broadcasted since it has no log or data files")
> //baseFileReader(baseFile)
> } {code}
> `buildMergeOnReadIterator` should be replaced by `HoodieFileGroupReader`,
> with a new config `hoodie.read.use.new.file.group.reader`.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)