[
https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinaykumar Bhat updated HUDI-7579:
----------------------------------
Component/s: index
Epic Link: HUDI-512
Fix Version/s: 1.0.0
> Functional index (on col stats) creation fails to process all files/partitions
> ------------------------------------------------------------------------------
>
> Key: HUDI-7579
> URL: https://issues.apache.org/jira/browse/HUDI-7579
> Project: Apache Hudi
> Issue Type: Bug
> Components: index
> Reporter: Vinaykumar Bhat
> Priority: Major
> Fix For: 1.0.0
>
>
> Creating a functional index on an existing table fails to process all files
> and partitions of the table. The col-stats MDT partition ends up having an
> entry only for subset of files that belong to the table. An example follows.
>
> The following create-table and inserts should create a table with 3
> partitions (with each partition having one slice){{{{}}{}}}
> {code:java}
> spark.sql(
> s"""
> |create table test_table(
> | id int,
> | name string,
> | ts long,
> | price int
> |) using hudi
> | options (
> | primaryKey ='id',
> | type = 'cow',
> | preCombineField = 'ts',
> | hoodie.metadata.record.index.enable = 'true',
> | hoodie.datasource.write.recordkey.field = 'id'
> | )
> | partitioned by(price)
> | location '$basePath'
> """.stripMargin)
> spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1',
> 1000, 10)")
> spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2',
> 200000, 100)")
> spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3',
> 2000000000, 1000)"){code}
> Now create a functional index (using col stats) on this table. The col-stat
> in the MDT should have three entries (representing column level stats for 3
> files). However, col stats only has one single entry (for one of the file).
>
> {code:java}
> var createIndexSql = s"create index idx_datestr on test_table using
> column_stats(ts) options(func='from_unixtime', format='yyyy-MM-dd')"
> spark.sql(createIndexSql)
> spark.sql(s"select key, type, ColumnStatsMetadata from
> hudi_metadata('test_table') where type = 3").show(false) {code}
> As seen below, col-stats has only one entry for one of the file (and is
> missing statistics for two other files):
> *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet,
> ts, {null, null, null, null, null, null,
> {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null,
> {1970-01-01}
> , null, null, null, null}, 1, 0, 434874, 869748, false}*
>
> {noformat}
>
> +------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> |key |type|ColumnStatsMetadata
>
>
>
> |
> +------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3
> |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet,
> ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null,
> null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null,
> null}, 1, 0, 434874, 869748, false}|
> +------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> {noformat}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)