[
https://issues.apache.org/jira/browse/HUDI-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yue Zhang updated HUDI-5245:
----------------------------
Fix Version/s: 0.14.0
(was: 0.13.1)
> Honor pruned partitions while looking up in col stats partition in MDT
> ----------------------------------------------------------------------
>
> Key: HUDI-5245
> URL: https://issues.apache.org/jira/browse/HUDI-5245
> Project: Apache Hudi
> Issue Type: Improvement
> Components: metadata
> Reporter: sivabalan narayanan
> Priority: Critical
> Fix For: 0.14.0
>
>
> When looking up in col stats for data skipping, we are passing in only the
> list of columns in the predicate. We don't leverage the pruned list of
> partitions in this call.
>
> For eg, if there are 1000 partitions and 5 cols w/ predicate, and only 10
> partitions are matched after pruning,
> exiting call will fetch 5 cols * 1000 partitions = 5k entries from col_stats
> partition in MDT to do file skipping.
> where as if we wire in pruned list of partitions, then we only need to do
> file skipping from 50 entries.
>
> {code:java}
> private def loadColumnStatsIndexRecords(targetColumns: Seq[String],
> shouldReadInMemory: Boolean): HoodieData[HoodieMetadataColumnStats] = {
> // Read Metadata Table's Column Stats Index records into [[HoodieData]]
> container by
> // - Fetching the records from CSI by key-prefixes (encoded column names)
> // - Extracting [[HoodieMetadataColumnStats]] records
> // - Filtering out nulls
> checkState(targetColumns.nonEmpty)
> // TODO encoding should be done internally w/in HoodieBackedTableMetadata
> val encodedTargetColumnNames = targetColumns.map(colName => new
> ColumnIndexID(colName).asBase64EncodedString())
> val metadataRecords: HoodieData[HoodieRecord[HoodieMetadataPayload]] =
> metadataTable.getRecordsByKeyPrefixes(encodedTargetColumnNames.asJava,
> HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS, shouldReadInMemory)
> .
> . {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)