[jira] [Commented] (HUDI-9585) Validate lookup set types during index lookup [non spark query engine]

Voon Hou (Jira) Sun, 27 Jul 2025 19:54:04 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-9585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18010239#comment-18010239
 ]


Voon Hou commented on HUDI-9585:
--------------------------------

Trino does implicit type coercion.

 

I have manually verified this using the following testing methodology:

1. Create a table with a column with a double type.

 
{code:java}
test("Create table multi filegroup partitioned mor") {
    withTempDir { tmp =>
        val tableName = "hudi_multi_fg_pt_mor"
        spark.sql(
            s"""
               |create table $tableName (
               |  id int,
               |  name string,
               |  price double,
               |  ts long,
               |  country string
               |) using hudi
               | location '${tmp.getCanonicalPath}'
               | tblproperties (
               |  primaryKey ='id,name',
               |  type = 'mor',
               |  preCombineField = 'ts'
               | ) partitioned by (country)
       """.stripMargin)
        // directly write to new parquet file
        spark.sql(s"set hoodie.parquet.small.file.limit=0")
        spark.sql(s"set hoodie.metadata.compact.max.delta.commits=1")
        // partition stats index is enabled together with column stats index
        spark.sql(s"set hoodie.metadata.index.column.stats.enable=true")
        spark.sql(s"set hoodie.metadata.record.index.enable=true")
        spark.sql(s"set hoodie.metadata.index.secondary.enable=true")
        spark.sql(s"set 
hoodie.metadata.index.column.stats.column.list=_hoodie_commit_time,_hoodie_partition_path,_hoodie_record_key,id,name,price,ts,country")
        // 2 filegroups per partition
        spark.sql(s"insert into $tableName values(1, 'a1', 100, 1000, 'SG'),(2, 
'a2', 200, 1000, 'US')")
        spark.sql(s"insert into $tableName values(3, 'a3', 101, 1001, 'SG'),(4, 
'a3', 201, 1001, 'US')")
        // create secondary index
        spark.sql(s"create index idx_price on $tableName (price)")
        // generate logs through updates
        spark.sql(s"update $tableName set price=price+1")
    }
} {code}
 

 

Add breakpoints to check the constraint type:

 

*Query 1: Comparison Operators*

 
{code:java}
getQueryRunner().execute(session, "SELECT * FROM " + table + " WHERE price = 
101"); {code}
*Result 1:*

 
{code:java}
Constraint[summary={price:double:REGULAR=[ SortedRangeSet[type=double, 
ranges=1, {[101.0]}] ]}, expression=true::boolean] {code}
 

 

*Query 2: IN Lists*
{code:java}
getQueryRunner().execute(session, "SELECT * FROM " + table + " WHERE price IN 
(101, 101, 99)"); {code}
*Result 2:*
{code:java}
Constraint[summary={price:double:REGULAR=[ SortedRangeSet[type=double, 
ranges=2, {[99.0], [101.0]}] ]}, expression=true::boolean] {code}
*Query 3: BETWEEN Operator*
{code:java}
getQueryRunner().execute(session, "SELECT * FROM " + table + " WHERE price 
BETWEEN 100 AND 200") {code}
*Result 3:*
{code:java}
Constraint[summary={price:double:REGULAR=[ SortedRangeSet[type=double, 
ranges=1, {[100.0,200.0]}] ]}, expression=true::boolean] {code}
 

*Query 4: JOIN Conditions*
{code:java}
getQueryRunner().execute(session, "WITH table_b_integers (int_col) AS ( VALUES 
(101), (250), (500) ) " +
        "SELECT * FROM " + table + " t JOIN table_b_integers b ON t.price = 
b.int_col"); {code}
*Result 4:*
{code:java}
Constraint[summary={price:double:REGULAR=[ SortedRangeSet[type=double, 
ranges=3, {[101.0], [250.0], [500.0]}] ]}, expression=true::boolean] {code}
 

 

As can be seen from the test cases above, implicit type casting is done on all 
4 cases, hence, the `toString` path should not be triggered.

> Validate lookup set types during index lookup [non spark query engine]
> ----------------------------------------------------------------------
>
>                 Key: HUDI-9585
>                 URL: https://issues.apache.org/jira/browse/HUDI-9585
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Davis Zhang
>            Assignee: Voon Hou
>            Priority: Critical
>             Fix For: 1.1.0
>
>
> context https://issues.apache.org/jira/browse/HUDI-9566
> *Description:*
> On the read path, ensure that the data type of lookup set literals is 
> compatible with the SI column’s declared type. Specifically:
>  * Allow index lookup only when types matches
>  * If incompatible:
>  ** *Preferred behavior:* Fallback to full table scan (no index)
>  ** *Alternative behavior:* Throw query error
> *Label:* {{{}blocker{}}}, {{{}si{}}}, {{lookup-validation}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-9585) Validate lookup set types during index lookup [non spark query engine]

Reply via email to