[
https://issues.apache.org/jira/browse/HUDI-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sivabalan narayanan updated HUDI-5245:
--------------------------------------
Description:
When looking up in col stats for data skipping, we are passing in only the list
of columns in the predicate. We don't leverage the pruned list of partitions in
this call.
For eg, if there are 1000 partitions and 5 cols w/ predicate, and only 10
partitions are matched after pruning,
exiting call will fetch 5 cols * 1000 partitions = 5k entries from col_stats
partition in MDT to do file skipping.
where as if we wire in pruned list of partitions, then we only need to do file
skipping from 50 entries.
{code:java}
private def loadColumnStatsIndexRecords(targetColumns: Seq[String],
shouldReadInMemory: Boolean): HoodieData[HoodieMetadataColumnStats] = {
// Read Metadata Table's Column Stats Index records into [[HoodieData]]
container by
// - Fetching the records from CSI by key-prefixes (encoded column names)
// - Extracting [[HoodieMetadataColumnStats]] records
// - Filtering out nulls
checkState(targetColumns.nonEmpty)
// TODO encoding should be done internally w/in HoodieBackedTableMetadata
val encodedTargetColumnNames = targetColumns.map(colName => new
ColumnIndexID(colName).asBase64EncodedString())
val metadataRecords: HoodieData[HoodieRecord[HoodieMetadataPayload]] =
metadataTable.getRecordsByKeyPrefixes(encodedTargetColumnNames.asJava,
HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS, shouldReadInMemory)
.
. {code}
was:
When looking up in col stats for data skipping, we are passing in only the list
of columns in the predicate. We don't leverage the pruned list of partitions in
this call.
For eg, if there are 1000 partitions and 100 cols, and only 10 partitions are
matched after pruning,
exiting call will fetch 100 cols * 1000 partitions = 10k entries from col_stats
partition in MDT to do file skipping.
where as if we wire in pruned list of partitions, then we only need to do file
skipping from 1000 entries.
{code:java}
private def loadColumnStatsIndexRecords(targetColumns: Seq[String],
shouldReadInMemory: Boolean): HoodieData[HoodieMetadataColumnStats] = {
// Read Metadata Table's Column Stats Index records into [[HoodieData]]
container by
// - Fetching the records from CSI by key-prefixes (encoded column names)
// - Extracting [[HoodieMetadataColumnStats]] records
// - Filtering out nulls
checkState(targetColumns.nonEmpty)
// TODO encoding should be done internally w/in HoodieBackedTableMetadata
val encodedTargetColumnNames = targetColumns.map(colName => new
ColumnIndexID(colName).asBase64EncodedString())
val metadataRecords: HoodieData[HoodieRecord[HoodieMetadataPayload]] =
metadataTable.getRecordsByKeyPrefixes(encodedTargetColumnNames.asJava,
HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS, shouldReadInMemory)
.
. {code}
> Honor pruned partitions while looking up in col stats partition in MDT
> ----------------------------------------------------------------------
>
> Key: HUDI-5245
> URL: https://issues.apache.org/jira/browse/HUDI-5245
> Project: Apache Hudi
> Issue Type: Improvement
> Components: metadata
> Reporter: sivabalan narayanan
> Priority: Major
>
> When looking up in col stats for data skipping, we are passing in only the
> list of columns in the predicate. We don't leverage the pruned list of
> partitions in this call.
>
> For eg, if there are 1000 partitions and 5 cols w/ predicate, and only 10
> partitions are matched after pruning,
> exiting call will fetch 5 cols * 1000 partitions = 5k entries from col_stats
> partition in MDT to do file skipping.
> where as if we wire in pruned list of partitions, then we only need to do
> file skipping from 50 entries.
>
> {code:java}
> private def loadColumnStatsIndexRecords(targetColumns: Seq[String],
> shouldReadInMemory: Boolean): HoodieData[HoodieMetadataColumnStats] = {
> // Read Metadata Table's Column Stats Index records into [[HoodieData]]
> container by
> // - Fetching the records from CSI by key-prefixes (encoded column names)
> // - Extracting [[HoodieMetadataColumnStats]] records
> // - Filtering out nulls
> checkState(targetColumns.nonEmpty)
> // TODO encoding should be done internally w/in HoodieBackedTableMetadata
> val encodedTargetColumnNames = targetColumns.map(colName => new
> ColumnIndexID(colName).asBase64EncodedString())
> val metadataRecords: HoodieData[HoodieRecord[HoodieMetadataPayload]] =
> metadataTable.getRecordsByKeyPrefixes(encodedTargetColumnNames.asJava,
> HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS, shouldReadInMemory)
> .
> . {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)