Vinaykumar Bhat created HUDI-7579:
-------------------------------------
Summary: Functional index (on col stats) creation fails to process
all files/partitions
Key: HUDI-7579
URL: https://issues.apache.org/jira/browse/HUDI-7579
Project: Apache Hudi
Issue Type: Bug
Reporter: Vinaykumar Bhat
The following create-table and inserts should create a table with 3 partitions
(with each partition having one slice){{{}{}}}
```
spark.sql(s"""create table test_table (id int, name string, ts long, price int)
using hudi
| options (
| primaryKey ='id',
| type = '$tableType',
| preCombineField = 'ts',
| hoodie.metadata.record.index.enable = 'true',
| hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(price)
| location '$basePath'
""".stripMargin)
spark.sql(s"insert into $tableName (id, name, ts, price) values(1, 'a1', 1000,
10)")
spark.sql(s"insert into $tableName (id, name, ts, price) values(2, 'a2',
200000, 100)")
spark.sql(s"insert into $tableName (id, name, ts, price) values(3, 'a3',
2000000000, 1000)")
```
Now create a functional index (using col stats) on this table. The col-stat in
the MDT should have three entries (representing column level stats for 3
files). However, col stats only has one single entry (for one of the file).
```
var createIndexSql = s"create index idx_datestr on test_table using
column_stats(ts) options(func='from_unixtime', format='yyyy-MM-dd')"
spark.sql(createIndexSql)
spark.sql(s"select key, type, ColumnStatsMetadata from
hudi_metadata('test_table') where type = 3").show(false)
```
As seen below, col-stats has only one entry for one of the file (and is missing
statistics for two other files):
`\{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet,
ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null},
\{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1,
0, 434874, 869748, false}`
{{+------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+}}
{{|key |type|ColumnStatsMetadata
|}}
{{+------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+}}
{{|oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3
|\{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet,
ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null},
\{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1,
0, 434874, 869748, false}|}}
{{+------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)