[
https://issues.apache.org/jira/browse/HUDI-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636882#comment-17636882
]
Alexey Kudinkin commented on HUDI-5245:
---------------------------------------
I'd actually clarify the requirement a little bit to avoid using both at the
same time:
# If CSI is enabled, we should lookup just the CSI
# If it's not enabled, we should do the partition-pruning
> Honor pruned partitions while looking up in col stats partition in MDT
> ----------------------------------------------------------------------
>
> Key: HUDI-5245
> URL: https://issues.apache.org/jira/browse/HUDI-5245
> Project: Apache Hudi
> Issue Type: Improvement
> Components: metadata
> Reporter: sivabalan narayanan
> Priority: Critical
> Fix For: 0.13.0
>
>
> When looking up in col stats for data skipping, we are passing in only the
> list of columns in the predicate. We don't leverage the pruned list of
> partitions in this call.
>
> For eg, if there are 1000 partitions and 5 cols w/ predicate, and only 10
> partitions are matched after pruning,
> exiting call will fetch 5 cols * 1000 partitions = 5k entries from col_stats
> partition in MDT to do file skipping.
> where as if we wire in pruned list of partitions, then we only need to do
> file skipping from 50 entries.
>
> {code:java}
> private def loadColumnStatsIndexRecords(targetColumns: Seq[String],
> shouldReadInMemory: Boolean): HoodieData[HoodieMetadataColumnStats] = {
> // Read Metadata Table's Column Stats Index records into [[HoodieData]]
> container by
> // - Fetching the records from CSI by key-prefixes (encoded column names)
> // - Extracting [[HoodieMetadataColumnStats]] records
> // - Filtering out nulls
> checkState(targetColumns.nonEmpty)
> // TODO encoding should be done internally w/in HoodieBackedTableMetadata
> val encodedTargetColumnNames = targetColumns.map(colName => new
> ColumnIndexID(colName).asBase64EncodedString())
> val metadataRecords: HoodieData[HoodieRecord[HoodieMetadataPayload]] =
> metadataTable.getRecordsByKeyPrefixes(encodedTargetColumnNames.asJava,
> HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS, shouldReadInMemory)
> .
> . {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)