JackieTien97 opened a new issue, #17799:
URL: https://github.com/apache/iotdb/issues/17799
### Motivation
Reshaping data between **wide** and **long** form is a very common need in
time-series analytics and reporting:
- **PIVOT (long → wide):** turn distinct values of a column (e.g. `region`,
`device_id`, a status code) into separate columns, aggregating a measurement
per cell. Great for building cross-tab/report-style result sets.
- **UNPIVOT (wide → long):** stack multiple measurement columns (e.g.
`temperature`, `humidity`, `pressure`) into a `(name, value)` pair. This is
especially natural for IoTDB, where the table model stores each measurement as
its own column and users frequently want a normalized long format for
downstream analytics or export.
Today the table model has no `PIVOT` / `UNPIVOT`, so users must hand-write
verbose `CASE WHEN ... END` aggregations (for pivoting) or long `UNION ALL`
chains (for unpivoting). This proposes first-class `PIVOT` / `UNPIVOT` support,
**using DuckDB's syntax as the reference**, since DuckDB offers both the
SQL-standard form and a friendly simplified form, and is the most ergonomic of
the mainstream engines.
> Examples below assume a table model table like `device_metrics(time,
device_id, region, temperature, humidity)`.
---
### Proposed Syntax (reference: DuckDB)
DuckDB implements **two** syntaxes for each statement: a simplified
(DuckDB-specific) form and the SQL-standard form. We can adopt one or both.
#### PIVOT
**Simplified form**
```sql
PIVOT ⟨table⟩
ON ⟨columns⟩ -- distinct values become new columns
USING ⟨aggregate(s)⟩ -- value of each cell
GROUP BY ⟨rows⟩ -- remaining row keys
[ORDER BY ...] [LIMIT ...];
```
IoTDB example — turn `region` values into columns of average temperature,
one row per device:
```sql
PIVOT device_metrics
ON region
USING avg(temperature)
GROUP BY device_id;
```
Restrict to specific values with `IN`, and use multiple/aliased aggregates:
```sql
PIVOT device_metrics
ON region IN ('north', 'south')
USING avg(temperature) AS avg_temp, max(temperature) AS max_temp
GROUP BY device_id;
-- => columns: device_id, north_avg_temp, north_max_temp, south_avg_temp,
south_max_temp
```
To reshape **without** aggregating, DuckDB uses `first()` (e.g. `USING
first(temperature)`).
**SQL-standard form**
```sql
SELECT * FROM device_metrics
PIVOT (
avg(temperature) AS avg_temp
FOR region IN ('north', 'south')
GROUP BY device_id
);
```
#### UNPIVOT
**Simplified form**
```sql
UNPIVOT ⟨table⟩
ON ⟨value-columns⟩
INTO NAME ⟨name-col⟩ VALUE ⟨value-col⟩;
```
IoTDB example — stack measurement columns into long `(measurement, value)`
rows:
```sql
UNPIVOT device_metrics
ON temperature, humidity
INTO NAME measurement VALUE value;
-- => columns: time, device_id, region, measurement, value
```
Dynamic column selection with `COLUMNS(* EXCLUDE (...))` (keeps working when
new measurements are added):
```sql
UNPIVOT device_metrics
ON COLUMNS(* EXCLUDE (time, device_id, region))
INTO NAME measurement VALUE value;
```
**SQL-standard form** (with optional `INCLUDE NULLS`; default drops rows
whose value is NULL):
```sql
FROM device_metrics
UNPIVOT [INCLUDE NULLS] (
value FOR measurement IN (temperature, humidity)
);
```
---
### Capabilities to cover (from DuckDB)
- **PIVOT**: `ON` (one or more columns), `USING` (one or more aggregates,
optional `AS` alias), `GROUP BY` rows; optional `IN (...)` list to fix the
pivoted values; generated column naming like `⟨value⟩_⟨agg-alias⟩`.
- **UNPIVOT**: `ON` explicit columns or `COLUMNS(* EXCLUDE (...))`, `INTO
NAME ... VALUE ...`; `INCLUDE NULLS`; (advanced) multiple value columns in one
statement; expressions/casts inside `ON` to reconcile differing column types.
- Usable as a top-level statement and inside subqueries / CTEs.
- (DuckDB also exposes `PIVOT_WIDER` / `PIVOT_LONGER` as aliases — optional.)
### Design decisions to discuss before implementing
1. **Which syntax first** — simplified, SQL-standard, or both. (Recommend
landing one end-to-end first, then the other.)
2. **Static vs. dynamic columns (biggest one).** `ON region` without an `IN`
list requires discovering distinct values at runtime, so the **output schema
isn't known at plan time**. Proposal: require an explicit `IN (...)` list in v1
(static schema), and treat auto-detection as a follow-up (it needs a
pre-execution scan of the source).
3. **Default grouping.** DuckDB defaults to `GROUP BY ALL`. In IoTDB,
defaulting to include the `time` column would explode cardinality — so we
should likely **require an explicit `GROUP BY`** (or define a sensible default
that excludes `time`).
4. **NULL handling for UNPIVOT.** Default drops NULL values; support
`INCLUDE NULLS`.
5. **Type unification for the UNPIVOT `value` column.** When source columns
differ in type, do we implicit-cast or require explicit casts (DuckDB requires
explicit casts)?
6. **Generated column naming** rules, and how non-identifier pivot values
(e.g. values with spaces) are quoted.
7. **Scope/phasing** of multi-column `ON`, multiple aggregates, and multiple
value columns.
### Prior Art
| Engine | PIVOT | UNPIVOT | Auto-detect (dynamic) columns |
Friendly/simplified syntax |
|---|:---:|:---:|:---:|:---:|
| DuckDB (reference) | ✅ | ✅ | ✅ | ✅ (`ON`/`USING`/`INTO`) |
| SQL Server | ✅ | ✅ | ❌ (must list values) | ❌ |
| Oracle | ✅ | ✅ | ❌ | ❌ |
| Snowflake | ✅ | ✅ | ✅ (`ANY` / subquery) | ❌ |
| Spark SQL | ✅ | ✅ | ❌ (must list values) | ❌ |
### Acceptance Criteria
- [ ] `PIVOT` with `ON` / `USING` / `GROUP BY` and an explicit `IN (...)`
list (static output schema).
- [ ] `UNPIVOT` with `INTO NAME ... VALUE ...` (simplified) and/or `FOR ...
IN (...)` (SQL-standard).
- [ ] `INCLUDE NULLS` option for `UNPIVOT` (default = drop NULLs).
- [ ] `PIVOT` / `UNPIVOT` usable inside subqueries and CTEs.
- [ ] Documented column-naming and value-column type rules.
- [ ] Tests + user documentation.
- [ ] (Phase 2) dynamic column auto-detection, multiple aggregates,
`COLUMNS(* EXCLUDE ...)`, multiple value columns.
### References
- DuckDB — PIVOT statement:
https://duckdb.org/docs/stable/sql/statements/pivot
- DuckDB — UNPIVOT statement:
https://duckdb.org/docs/stable/sql/statements/unpivot
- DuckDB — Friendly SQL (background):
https://duckdb.org/2023/08/23/even-friendlier-sql
---
> Note: this can be split into two independently claimable tasks — **PIVOT**
and **UNPIVOT** — if preferred.
>
> _Suggested labels: `enhancement`, table model / SQL._
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]