DaZuiZui commented on issue #17799:
URL: https://github.com/apache/iotdb/issues/17799#issuecomment-4620232088
I would like to propose implementing #17799 in phases.
This first PR will address only the **UNPIVOT** part of #17799. **PIVOT**
and advanced UNPIVOT features will be handled in follow-up PRs **after the
Phase 1 PR has been reviewed and merged**.
The first PR should be marked as `Part of #17799`.
It should not use `Fixes #17799` or `Closes #17799`, because this PR will
not complete the whole issue.
## Phase 1: UNPIVOT v1
## Feature Definition
`UNPIVOT` converts multiple measurement columns in the IoTDB table model
from a **wide-table** format into a **long-table** format.
Before UNPIVOT:
```text
time, device_id, region, temperature, humidity
```
After UNPIVOT:
```text
time, device_id, region, measurement, value
```
Here:
- `measurement` stores the original source column name, such as
`temperature` or `humidity`.
- `value` stores the corresponding source column value.
This is useful for IoT scenarios where measurements are stored as separate
columns, but downstream analytics, export, or visualization prefers a
normalized `(measurement, value)` format.
## Syntax
`UNPIVOT v1` will support a subset of DuckDB simplified UNPIVOT syntax, with
`INCLUDE NULLS` added to satisfy the NULL handling requirement in #17799.
```sql
UNPIVOT [INCLUDE NULLS] source_relation
ON value_column [, value_column ...]
INTO NAME name_column VALUE value_column_alias;
```
Example:
```sql
UNPIVOT device_metrics
ON temperature, humidity
INTO NAME measurement VALUE value;
```
It can also be used in subqueries:
```sql
SELECT *
FROM (
UNPIVOT device_metrics
ON temperature, humidity
INTO NAME measurement VALUE value
);
```
And with CTEs:
```sql
WITH long_metrics AS (
UNPIVOT device_metrics
ON temperature, humidity
INTO NAME measurement VALUE value
)
SELECT * FROM long_metrics;
```
## Syntax Components
### `UNPIVOT`
Performs the wide-to-long transformation.
### `INCLUDE NULLS`
Optional.
By default, rows whose unpivoted value is `NULL` are dropped. If `INCLUDE
NULLS` is specified, those rows are kept.
### `source_relation`
The input relation.
In v1, I propose to support:
- normal table names
- CTE names
- parenthesized subqueries
For more complex inputs such as joins, users can wrap them in a subquery or
CTE first.
### `ON value_column [, ...]`
Specifies the source columns to unpivot.
In v1, only explicit column names are supported. The following are not
included in this phase:
- expressions
- `CAST`
- wildcards
- dynamic column selection
- `COLUMNS(* EXCLUDE (...))`
### `INTO NAME name_column`
Specifies the generated column that stores the original source column name.
The generated name column should use the existing text type in the IoTDB
table model, either `STRING` or `TEXT` depending on the current internal type
system.
### `VALUE value_column_alias`
Specifies the generated column that stores the original source column value.
Its type is the same as the strictly identical type of all columns listed in
the `ON` clause.
## Output Columns
The output columns are:
1. visible columns from `source_relation` that are not listed in `ON`
2. `name_column`
3. `value_column_alias`
For example, given source columns:
```text
time, device_id, region, temperature, humidity
```
The query:
```sql
UNPIVOT device_metrics
ON temperature, humidity
INTO NAME measurement VALUE value;
```
produces:
```text
time, device_id, region, measurement, value
```
`temperature` and `humidity` are no longer output as separate columns.
## NULL Semantics
By default, `UNPIVOT` drops `NULL` values.
```sql
UNPIVOT device_metrics
ON temperature, humidity
INTO NAME measurement VALUE value;
```
is semantically equivalent to:
```sql
SELECT time, device_id, region,
'temperature' AS measurement,
temperature AS value
FROM device_metrics
WHERE temperature IS NOT NULL
UNION ALL
SELECT time, device_id, region,
'humidity' AS measurement,
humidity AS value
FROM device_metrics
WHERE humidity IS NOT NULL;
```
With `INCLUDE NULLS`, the `IS NOT NULL` filters are not applied.
`UNPIVOT` should not guarantee output row order. Users should use `ORDER BY`
if they need stable ordering. Tests should also use explicit ordering.
## Type Rules
In v1, all columns listed in the `ON` clause must have exactly the same type.
Supported types should follow the existing IoTDB table model scalar types,
such as:
```text
BOOLEAN
INT32
INT64
FLOAT
DOUBLE
TEXT / STRING
TIMESTAMP
DATE
BLOB
```
There should be no automatic type coercion in v1. For example:
- `INT32 + INT64` should not be automatically coerced.
- `FLOAT + DOUBLE` should not be automatically coerced.
- `TEXT + STRING` should follow the existing IoTDB type system.
This strict type check should be performed in the `UNPIVOT` analyzer before
planning. We should not rely only on existing `UNION ALL` analysis, because set
operations may find a common super type, which would conflict with the strict
v1 semantics.
## Validation Rules
`UNPIVOT v1` should validate that:
- `ON` columns exist
- `ON` columns are not duplicated
- `ON` columns are visible columns
- `name_column` does not conflict with retained output columns
- `value_column_alias` does not conflict with retained output columns
- `name_column` and `value_column_alias` do not conflict with each other
- quoted identifier and case-insensitive conflicts follow existing IoTDB
identifier rules
## Implementation Plan
I propose not to rewrite `UNPIVOT` into `UNION ALL` in the pure parser or
pre-analyzer rewrite phase, because source visible columns, hidden columns,
types, and CTE/subquery schemas are only known after analysis.
A safer implementation flow is:
1. Add `UNPIVOT` as a `queryPrimary` / query body alternative in the
relational SQL grammar.
2. Parser builds an `Unpivot` AST node.
3. Analyzer analyzes `source_relation` first, then validates `ON` columns,
visibility, duplicates, output-name conflicts, and strict same-type rules.
4. Analyzer determines the static output schema for the `Unpivot` node.
5. Planner expands `Unpivot` into an equivalent `UNION ALL` plan, or reuses
existing set operation planning where possible.
Placing `UNPIVOT` at the `queryPrimary` / query body level makes it
naturally support top-level usage, subqueries, and CTEs, which matches the
issue requirement and DuckDB simplified syntax.
## Test Plan
The first phase should cover at least:
- top-level `UNPIVOT`
- CTE input
- subquery input
- default NULL dropping
- `INCLUDE NULLS`
- missing `ON` column
- duplicate `ON` column
- `name` / `value` output column conflict
- mismatched value column types
- `INT32 + INT64` not automatically coerced
- quoted identifier / case-insensitive conflicts
## Phased Plan
I plan to implement this issue in phases.
Phase 1 will focus only on `UNPIVOT v1`. It will support:
- explicit value columns
- static output schema
- default NULL dropping
- optional `INCLUDE NULLS`
- strict same-type validation
- usage as a top-level query, subquery, or CTE
After the Phase 1 PR is reviewed and merged, I will start Phase 2 in
follow-up PRs.
Phase 2 will cover `PIVOT` and advanced `UNPIVOT` features, such as:
- dynamic column selection
- `COLUMNS(* EXCLUDE (...))`
- expressions
- `CAST`
- automatic type coercion
- multiple `VALUE` columns
The reason for waiting until Phase 1 is merged before starting Phase 2 is
that `UNPIVOT v1` introduces the basic grammar, AST, analyzer, planner, and
test structure for this family of syntax. It is better to first get agreement
from reviewers on these foundations before adding more complex features on top
of them.
This split also keeps the first PR small and reviewable. `UNPIVOT v1` has a
static output schema, so it can be analyzed and planned safely. `PIVOT`,
especially dynamic `PIVOT` without an explicit `IN` list, may require runtime
discovery of distinct values, which makes output schema determination more
complex.
Doing `PIVOT` before the `UNPIVOT` foundation is reviewed could increase
implementation risk and make the PR harder to review.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]