wuguowei1994 opened a new pull request, #63118:
URL: https://github.com/apache/doris/pull/63118

   ## Summary
   
   On the current master branch, inverted index predicate pushdown does not 
work correctly when querying `VARIANT` fields with explicit `CAST`.
   
   This is a serious issue because it is not limited to a single target type. 
In our testing, predicates in the form of `CAST(variant_field["key"] AS <type>) 
= ...` fail to leverage the inverted index properly across casted `VARIANT` 
access patterns.
   
   This is especially problematic because the recommended usage for `VARIANT` 
fields is to explicitly use `CAST` when extracting typed values. In our 
internal production workloads, `VARIANT` is heavily used, and all business 
teams are required to query `VARIANT` subfields through explicit `CAST`. As a 
result, these queries cannot benefit from inverted index filtering and end up 
scanning significantly more rows than expected, causing severe performance 
degradation.
   
   For production workloads with large `VARIANT` columns, this effectively 
makes the inverted index unusable for the officially recommended query pattern, 
which has a major impact on query latency and resource consumption.
   
   ---
   
   ## Reproduction
   
   ```sql
   DROP TABLE IF EXISTS variant_inverted_intkey_test;
   
   CREATE TABLE variant_inverted_intkey_test (
       row_id BIGINT,
       v VARIANT,
       INDEX idx_v(v) USING INVERTED
   )
   ENGINE=OLAP
   DUPLICATE KEY(row_id)
   DISTRIBUTED BY HASH(row_id) BUCKETS 1
   PROPERTIES (
       "replication_num" = "1",
       "disable_auto_compaction" = "true",
       "inverted_index_storage_format" = "v2"
   );
   
   INSERT INTO variant_inverted_intkey_test VALUES
   (1,  '{"int_key": 1}'),
   (2,  '{"int_key": 2}'),
   (3,  '{"int_key": 3}'),
   (4,  '{"int_key": 4}'),
   (5,  '{"int_key": 5}'),
   (6,  '{"int_key": 6}'),
   (7,  '{"int_key": 7}'),
   (8,  '{"int_key": 8}'),
   (9,  '{"int_key": 9}'),
   (10, '{"int_key": 10}'),
   (11, '{"int_key": 11}'),
   (12, '{"int_key": 12}'),
   (13, '{"int_key": 13}'),
   (14, '{"int_key": 14}'),
   (15, '{"int_key": 15}'),
   (16, '{"int_key": 16}'),
   (17, '{"int_key": 17}'),
   (18, '{"int_key": 18}'),
   (19, '{"int_key": 19}'),
   (20, '{"int_key": 20}');
   
   SELECT row_id, CAST(v["int_key"] AS INT) AS int_key
   FROM variant_inverted_intkey_test
   WHERE CAST(v["int_key"] AS INT) = 13;
   ```
   
   ---
   
   ## Expected Behavior
   
   The predicate:
   
   ```sql
   CAST(v["int_key"] AS INT) = 13
   ```
   
   should be pushed down to the inverted index on the `VARIANT` column, and the 
query should use the inverted index to filter rows before data scanning.
   
   Only the matching row should need to be read after index filtering.
   
   ---
   
   ## Actual Behavior
   
   The query result is correct, but the query profile shows that the inverted 
index does not effectively filter the data.
   
   Instead of being pruned by the inverted index, all 20 rows are still 
read/scanned. This indicates that the predicate involving `CAST` on the 
`VARIANT` subfield is not correctly handled by inverted index predicate 
pushdown.
   
   Please check the query profile after running the reproduction SQL. The key 
point is that the inverted index does not successfully reduce the scanned rows 
for the casted `VARIANT` predicate.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to