[jira] [Updated] (HUDI-5245) Honor pruned partitions while looking up in col stats partition in MDT

sivabalan narayanan (Jira) Sat, 19 Nov 2022 23:37:07 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


sivabalan narayanan updated HUDI-5245:
--------------------------------------
    Description: 
When looking up in col stats for data skipping, we are passing in only the list 
of columns in the predicate. We don't leverage the pruned list of partitions in 
this call.

 

For eg, if there are 1000 partitions and 5 cols w/ predicate, and only 10 
partitions are matched after pruning,

exiting call will fetch 5 cols * 1000 partitions = 5k entries from col_stats 
partition in MDT to do file skipping.
where as if we wire in pruned list of partitions, then we only need to do file 
skipping from 50 entries. 

 
{code:java}
private def loadColumnStatsIndexRecords(targetColumns: Seq[String], 
shouldReadInMemory: Boolean): HoodieData[HoodieMetadataColumnStats] = {
  // Read Metadata Table's Column Stats Index records into [[HoodieData]] 
container by
  //    - Fetching the records from CSI by key-prefixes (encoded column names)
  //    - Extracting [[HoodieMetadataColumnStats]] records
  //    - Filtering out nulls
  checkState(targetColumns.nonEmpty)

  // TODO encoding should be done internally w/in HoodieBackedTableMetadata
  val encodedTargetColumnNames = targetColumns.map(colName => new 
ColumnIndexID(colName).asBase64EncodedString())

  val metadataRecords: HoodieData[HoodieRecord[HoodieMetadataPayload]] =
    metadataTable.getRecordsByKeyPrefixes(encodedTargetColumnNames.asJava, 
HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS, shouldReadInMemory)
.
. {code}

  was:
When looking up in col stats for data skipping, we are passing in only the list 
of columns in the predicate. We don't leverage the pruned list of partitions in 
this call.

 

For eg, if there are 1000 partitions and 100 cols, and only 10 partitions are 
matched after pruning,

exiting call will fetch 100 cols * 1000 partitions = 10k entries from col_stats 
partition in MDT to do file skipping.
where as if we wire in pruned list of partitions, then we only need to do file 
skipping from 1000 entries. 

 
{code:java}
private def loadColumnStatsIndexRecords(targetColumns: Seq[String], 
shouldReadInMemory: Boolean): HoodieData[HoodieMetadataColumnStats] = {
  // Read Metadata Table's Column Stats Index records into [[HoodieData]] 
container by
  //    - Fetching the records from CSI by key-prefixes (encoded column names)
  //    - Extracting [[HoodieMetadataColumnStats]] records
  //    - Filtering out nulls
  checkState(targetColumns.nonEmpty)

  // TODO encoding should be done internally w/in HoodieBackedTableMetadata
  val encodedTargetColumnNames = targetColumns.map(colName => new 
ColumnIndexID(colName).asBase64EncodedString())

  val metadataRecords: HoodieData[HoodieRecord[HoodieMetadataPayload]] =
    metadataTable.getRecordsByKeyPrefixes(encodedTargetColumnNames.asJava, 
HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS, shouldReadInMemory)
.
. {code}


> Honor pruned partitions while looking up in col stats partition in MDT
> ----------------------------------------------------------------------
>
>                 Key: HUDI-5245
>                 URL: https://issues.apache.org/jira/browse/HUDI-5245
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: metadata
>            Reporter: sivabalan narayanan
>            Priority: Major
>
> When looking up in col stats for data skipping, we are passing in only the 
> list of columns in the predicate. We don't leverage the pruned list of 
> partitions in this call.
>  
> For eg, if there are 1000 partitions and 5 cols w/ predicate, and only 10 
> partitions are matched after pruning,
> exiting call will fetch 5 cols * 1000 partitions = 5k entries from col_stats 
> partition in MDT to do file skipping.
> where as if we wire in pruned list of partitions, then we only need to do 
> file skipping from 50 entries. 
>  
> {code:java}
> private def loadColumnStatsIndexRecords(targetColumns: Seq[String], 
> shouldReadInMemory: Boolean): HoodieData[HoodieMetadataColumnStats] = {
>   // Read Metadata Table's Column Stats Index records into [[HoodieData]] 
> container by
>   //    - Fetching the records from CSI by key-prefixes (encoded column names)
>   //    - Extracting [[HoodieMetadataColumnStats]] records
>   //    - Filtering out nulls
>   checkState(targetColumns.nonEmpty)
>   // TODO encoding should be done internally w/in HoodieBackedTableMetadata
>   val encodedTargetColumnNames = targetColumns.map(colName => new 
> ColumnIndexID(colName).asBase64EncodedString())
>   val metadataRecords: HoodieData[HoodieRecord[HoodieMetadataPayload]] =
>     metadataTable.getRecordsByKeyPrefixes(encodedTargetColumnNames.asJava, 
> HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS, shouldReadInMemory)
> .
> . {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5245) Honor pruned partitions while looking up in col stats partition in MDT

Reply via email to