jtuglu1 commented on issue #5897:
URL: https://github.com/apache/druid/issues/5897#issuecomment-3862310445

   Here're my thoughts on this issue
   
   ## 1. Motivation
   
   The SQL planner chooses topN vs groupBy based on query shape (adding `LIMIT` 
and
   `ORDER BY metric` switches from groupBy to topN), and the two engines 
disagree on
   null/empty groups from `MV_FILTER_ONLY` / `MV_FILTER_REGEX`:
   
   ```sql
   -- groupBy engine: includes "no match" group
   SELECT MV_FILTER_ONLY(mv, ARRAY['c']), SUM(cnt) FROM tbl GROUP BY 1 ORDER BY 
1
   → [{null, 30}, {"c", 6}]
   
   -- topN engine: drops "no match" group
   SELECT MV_FILTER_ONLY(mv, ARRAY['c']), SUM(cnt) FROM tbl GROUP BY 1 ORDER BY 
2 DESC LIMIT 100
   → [{"c", 6}]
   ```
   
   **This proposal aligns all engines with topN behavior**: `MV_FILTER_ONLY` and
   `MV_FILTER_REGEX` exclude non-matching rows from the result entirely. groupBy
   must match topN by also dropping rows where no elements pass the filter.
   
   ---
   
   ## 2. Core Value Taxonomy
   
   | Value | Description |
   |---|---|
   | `null` | Absent dimension (SQL NULL) |
   | `""` | Single empty string value |
   | `[]` | Empty multi-value dimension |
   | `[null]` | MVD containing one null element |
   | `["a",""]` | MVD with values including empty string |
   
   ---
   
   ## 3. Equivalence Rules
   
   | Rule | Implication |
   |---|---|
   | `""` == `[]` | Empty string and empty MVD are interchangeable in grouping, 
filtering, and comparison |
   | `null` != `[]` | Absent dimension is distinct from an absent/null MVD |
   | `null` != `[null]` | Absent/null dimension is distinct from an MVD 
containing a null element |
   
   ---
   
   ## 4. Reference Dataset
   
   | multi_value | cnt |
   |---|---|
   | `null` | 1 |
   | `[]` | 5 |
   | `[null]` | 2 |
   | `""` | 12 |
   | `["a",""]` | 9 |
   | `["b","c"]` | 17 |
   | `["c","d"]` | 37 |
   | `["a","b","c","d",null,""]` | 3 |
   
   **Total:** 86
   
   ---
   
   ## 5. Summary of New Behaviors
   
   ### GROUP BY on MVD
   
   MVD explosion assigns each element to a group. The equivalence rules 
determine
   which group:
   
   | Group | cnt | Source |
   |---|---|---|
   | `null` | 6 | null(1) + [null](2) + null-in-mixed(3) |
   | `""` | 29 | [](5) + ""(12) + ""-in-["a",""](9) + ""-in-mixed(3) |
   | `"a"` | 12 | ["a",""](9) + mixed(3) |
   | `"b"` | 20 | ["b","c"](17) + mixed(3) |
   | `"c"` | 57 | ["b","c"](17) + ["c","d"](37) + mixed(3) |
   | `"d"` | 40 | ["c","d"](37) + mixed(3) |
   
   - `[]` explodes as `""` (since `""` == `[]`), NOT as `null`
   - `[null]` explodes as `null` (the element is null)
   - All engines (topN, groupBy) must return identical groups
   
   ### IS NULL / IS NOT NULL
   
   Since `""` == `[]`, and `""` IS NOT NULL, then `[]` IS NOT NULL.
   
   | Value (cnt) | IS NULL | Rationale |
   |---|---|---|
   | `null` (1) | T | Absent dimension |
   | `[]` (5) | **F** | `""` == `[]`, not null |
   | `[null]` (2) | T | Contains null element |
   | `""` (12) | F | Real value |
   | `["a",""]` (9) | F | All elements are non-null |
   | `["b","c"]` (17) | F | All elements are non-null |
   | `["c","d"]` (37) | F | All elements are non-null |
   | `["a","b","c","d",null,""]` (3) | T | Contains null element |
   
   **IS NULL total: 6. IS NOT NULL total: 80.**
   
   ### MV_FILTER_ONLY / MV_FILTER_REGEX
   
   These functions act as filter AND projection: rows where no elements match 
are
   excluded from the result entirely (not grouped into a null or `""` bucket).
   
   | Input state | Behavior |
   |---|---|
   | No elements match filter | **Row excluded from result** |
   | Elements match | Matched elements returned, row included |
   
   Example: `SELECT MV_FILTER_ONLY(multi_value, ARRAY['c']), SUM(cnt) FROM tbl 
GROUP BY 1`
   
   | Value (cnt) | Has match? | Result |
   |---|---|---|
   | `null` (1) | No | **excluded** |
   | `[]` (5) | No | **excluded** |
   | `[null]` (2) | No | **excluded** |
   | `""` (12) | No | **excluded** |
   | `["a",""]` (9) | No | **excluded** |
   | `["b","c"]` (17) | Yes | `"c"` |
   | `["c","d"]` (37) | Yes | `"c"` |
   | `["a","b","c","d",null,""]` (3) | Yes | `"c"` |
   
   Expected result (both topN and groupBy):
   
   | EXPR$0 | cnt |
   |---|---|
   | `"c"` | 57 |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to