jtuglu1 commented on issue #5897:
URL: https://github.com/apache/druid/issues/5897#issuecomment-3862310445
Here're my thoughts on this issue
## 1. Motivation
The SQL planner chooses topN vs groupBy based on query shape (adding `LIMIT`
and
`ORDER BY metric` switches from groupBy to topN), and the two engines
disagree on
null/empty groups from `MV_FILTER_ONLY` / `MV_FILTER_REGEX`:
```sql
-- groupBy engine: includes "no match" group
SELECT MV_FILTER_ONLY(mv, ARRAY['c']), SUM(cnt) FROM tbl GROUP BY 1 ORDER BY
1
→ [{null, 30}, {"c", 6}]
-- topN engine: drops "no match" group
SELECT MV_FILTER_ONLY(mv, ARRAY['c']), SUM(cnt) FROM tbl GROUP BY 1 ORDER BY
2 DESC LIMIT 100
→ [{"c", 6}]
```
**This proposal aligns all engines with topN behavior**: `MV_FILTER_ONLY` and
`MV_FILTER_REGEX` exclude non-matching rows from the result entirely. groupBy
must match topN by also dropping rows where no elements pass the filter.
---
## 2. Core Value Taxonomy
| Value | Description |
|---|---|
| `null` | Absent dimension (SQL NULL) |
| `""` | Single empty string value |
| `[]` | Empty multi-value dimension |
| `[null]` | MVD containing one null element |
| `["a",""]` | MVD with values including empty string |
---
## 3. Equivalence Rules
| Rule | Implication |
|---|---|
| `""` == `[]` | Empty string and empty MVD are interchangeable in grouping,
filtering, and comparison |
| `null` != `[]` | Absent dimension is distinct from an absent/null MVD |
| `null` != `[null]` | Absent/null dimension is distinct from an MVD
containing a null element |
---
## 4. Reference Dataset
| multi_value | cnt |
|---|---|
| `null` | 1 |
| `[]` | 5 |
| `[null]` | 2 |
| `""` | 12 |
| `["a",""]` | 9 |
| `["b","c"]` | 17 |
| `["c","d"]` | 37 |
| `["a","b","c","d",null,""]` | 3 |
**Total:** 86
---
## 5. Summary of New Behaviors
### GROUP BY on MVD
MVD explosion assigns each element to a group. The equivalence rules
determine
which group:
| Group | cnt | Source |
|---|---|---|
| `null` | 6 | null(1) + [null](2) + null-in-mixed(3) |
| `""` | 29 | [](5) + ""(12) + ""-in-["a",""](9) + ""-in-mixed(3) |
| `"a"` | 12 | ["a",""](9) + mixed(3) |
| `"b"` | 20 | ["b","c"](17) + mixed(3) |
| `"c"` | 57 | ["b","c"](17) + ["c","d"](37) + mixed(3) |
| `"d"` | 40 | ["c","d"](37) + mixed(3) |
- `[]` explodes as `""` (since `""` == `[]`), NOT as `null`
- `[null]` explodes as `null` (the element is null)
- All engines (topN, groupBy) must return identical groups
### IS NULL / IS NOT NULL
Since `""` == `[]`, and `""` IS NOT NULL, then `[]` IS NOT NULL.
| Value (cnt) | IS NULL | Rationale |
|---|---|---|
| `null` (1) | T | Absent dimension |
| `[]` (5) | **F** | `""` == `[]`, not null |
| `[null]` (2) | T | Contains null element |
| `""` (12) | F | Real value |
| `["a",""]` (9) | F | All elements are non-null |
| `["b","c"]` (17) | F | All elements are non-null |
| `["c","d"]` (37) | F | All elements are non-null |
| `["a","b","c","d",null,""]` (3) | T | Contains null element |
**IS NULL total: 6. IS NOT NULL total: 80.**
### MV_FILTER_ONLY / MV_FILTER_REGEX
These functions act as filter AND projection: rows where no elements match
are
excluded from the result entirely (not grouped into a null or `""` bucket).
| Input state | Behavior |
|---|---|
| No elements match filter | **Row excluded from result** |
| Elements match | Matched elements returned, row included |
Example: `SELECT MV_FILTER_ONLY(multi_value, ARRAY['c']), SUM(cnt) FROM tbl
GROUP BY 1`
| Value (cnt) | Has match? | Result |
|---|---|---|
| `null` (1) | No | **excluded** |
| `[]` (5) | No | **excluded** |
| `[null]` (2) | No | **excluded** |
| `""` (12) | No | **excluded** |
| `["a",""]` (9) | No | **excluded** |
| `["b","c"]` (17) | Yes | `"c"` |
| `["c","d"]` (37) | Yes | `"c"` |
| `["a","b","c","d",null,""]` (3) | Yes | `"c"` |
Expected result (both topN and groupBy):
| EXPR$0 | cnt |
|---|---|
| `"c"` | 57 |
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]