Vinaykumar Bhat created HUDI-7579:
-------------------------------------

             Summary: Functional index (on col stats) creation fails to process 
all files/partitions
                 Key: HUDI-7579
                 URL: https://issues.apache.org/jira/browse/HUDI-7579
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Vinaykumar Bhat


The following create-table and inserts should create a table with 3 partitions 
(with each partition having one slice){{{}{}}}

```
spark.sql(s"""create table test_table (id int, name string, ts long, price int) 
using hudi
| options (
| primaryKey ='id',
| type = '$tableType',
| preCombineField = 'ts',
| hoodie.metadata.record.index.enable = 'true',
| hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(price)
| location '$basePath'
""".stripMargin)

spark.sql(s"insert into $tableName (id, name, ts, price) values(1, 'a1', 1000, 
10)")
spark.sql(s"insert into $tableName (id, name, ts, price) values(2, 'a2', 
200000, 100)")
spark.sql(s"insert into $tableName (id, name, ts, price) values(3, 'a3', 
2000000000, 1000)")
```
 
Now create a functional index (using col stats) on this table. The col-stat in 
the MDT should have three entries (representing column level stats for 3 
files). However, col stats only has one single entry (for one of the file).
 
```
var createIndexSql = s"create index idx_datestr on test_table using 
column_stats(ts) options(func='from_unixtime', format='yyyy-MM-dd')"

spark.sql(createIndexSql)

spark.sql(s"select key, type, ColumnStatsMetadata from 
hudi_metadata('test_table') where type = 3").show(false)
```
 
As seen below, col-stats has only one entry for one of the file (and is missing 
statistics for two other files): 
`\{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, 
ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 
\{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 
0, 434874, 869748, false}`
{{+------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+}}
{{|key                                             |type|ColumnStatsMetadata    
                                                                                
                                                                                
                                                                            |}}
{{+------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+}}
{{|oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3   
|\{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, 
ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 
\{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 
0, 434874, 869748, false}|}}
{{+------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+}}
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to