JackieTien97 opened a new issue, #17799:
URL: https://github.com/apache/iotdb/issues/17799

   ### Motivation
   
   Reshaping data between **wide** and **long** form is a very common need in 
time-series analytics and reporting:
   
   - **PIVOT (long → wide):** turn distinct values of a column (e.g. `region`, 
`device_id`, a status code) into separate columns, aggregating a measurement 
per cell. Great for building cross-tab/report-style result sets.
   - **UNPIVOT (wide → long):** stack multiple measurement columns (e.g. 
`temperature`, `humidity`, `pressure`) into a `(name, value)` pair. This is 
especially natural for IoTDB, where the table model stores each measurement as 
its own column and users frequently want a normalized long format for 
downstream analytics or export.
   
   Today the table model has no `PIVOT` / `UNPIVOT`, so users must hand-write 
verbose `CASE WHEN ... END` aggregations (for pivoting) or long `UNION ALL` 
chains (for unpivoting). This proposes first-class `PIVOT` / `UNPIVOT` support, 
**using DuckDB's syntax as the reference**, since DuckDB offers both the 
SQL-standard form and a friendly simplified form, and is the most ergonomic of 
the mainstream engines.
   
   > Examples below assume a table model table like `device_metrics(time, 
device_id, region, temperature, humidity)`.
   
   ---
   
   ### Proposed Syntax (reference: DuckDB)
   
   DuckDB implements **two** syntaxes for each statement: a simplified 
(DuckDB-specific) form and the SQL-standard form. We can adopt one or both.
   
   #### PIVOT
   
   **Simplified form**
   
   ```sql
   PIVOT ⟨table⟩
   ON ⟨columns⟩                 -- distinct values become new columns
   USING ⟨aggregate(s)⟩         -- value of each cell
   GROUP BY ⟨rows⟩              -- remaining row keys
   [ORDER BY ...] [LIMIT ...];
   ```
   
   IoTDB example — turn `region` values into columns of average temperature, 
one row per device:
   
   ```sql
   PIVOT device_metrics
   ON region
   USING avg(temperature)
   GROUP BY device_id;
   ```
   
   Restrict to specific values with `IN`, and use multiple/aliased aggregates:
   
   ```sql
   PIVOT device_metrics
   ON region IN ('north', 'south')
   USING avg(temperature) AS avg_temp, max(temperature) AS max_temp
   GROUP BY device_id;
   -- => columns: device_id, north_avg_temp, north_max_temp, south_avg_temp, 
south_max_temp
   ```
   
   To reshape **without** aggregating, DuckDB uses `first()` (e.g. `USING 
first(temperature)`).
   
   **SQL-standard form**
   
   ```sql
   SELECT * FROM device_metrics
   PIVOT (
     avg(temperature) AS avg_temp
     FOR region IN ('north', 'south')
     GROUP BY device_id
   );
   ```
   
   #### UNPIVOT
   
   **Simplified form**
   
   ```sql
   UNPIVOT ⟨table⟩
   ON ⟨value-columns⟩
   INTO NAME ⟨name-col⟩ VALUE ⟨value-col⟩;
   ```
   
   IoTDB example — stack measurement columns into long `(measurement, value)` 
rows:
   
   ```sql
   UNPIVOT device_metrics
   ON temperature, humidity
   INTO NAME measurement VALUE value;
   -- => columns: time, device_id, region, measurement, value
   ```
   
   Dynamic column selection with `COLUMNS(* EXCLUDE (...))` (keeps working when 
new measurements are added):
   
   ```sql
   UNPIVOT device_metrics
   ON COLUMNS(* EXCLUDE (time, device_id, region))
   INTO NAME measurement VALUE value;
   ```
   
   **SQL-standard form** (with optional `INCLUDE NULLS`; default drops rows 
whose value is NULL):
   
   ```sql
   FROM device_metrics
   UNPIVOT [INCLUDE NULLS] (
     value FOR measurement IN (temperature, humidity)
   );
   ```
   
   ---
   
   ### Capabilities to cover (from DuckDB)
   
   - **PIVOT**: `ON` (one or more columns), `USING` (one or more aggregates, 
optional `AS` alias), `GROUP BY` rows; optional `IN (...)` list to fix the 
pivoted values; generated column naming like `⟨value⟩_⟨agg-alias⟩`.
   - **UNPIVOT**: `ON` explicit columns or `COLUMNS(* EXCLUDE (...))`, `INTO 
NAME ... VALUE ...`; `INCLUDE NULLS`; (advanced) multiple value columns in one 
statement; expressions/casts inside `ON` to reconcile differing column types.
   - Usable as a top-level statement and inside subqueries / CTEs.
   - (DuckDB also exposes `PIVOT_WIDER` / `PIVOT_LONGER` as aliases — optional.)
   
   ### Design decisions to discuss before implementing
   
   1. **Which syntax first** — simplified, SQL-standard, or both. (Recommend 
landing one end-to-end first, then the other.)
   2. **Static vs. dynamic columns (biggest one).** `ON region` without an `IN` 
list requires discovering distinct values at runtime, so the **output schema 
isn't known at plan time**. Proposal: require an explicit `IN (...)` list in v1 
(static schema), and treat auto-detection as a follow-up (it needs a 
pre-execution scan of the source).
   3. **Default grouping.** DuckDB defaults to `GROUP BY ALL`. In IoTDB, 
defaulting to include the `time` column would explode cardinality — so we 
should likely **require an explicit `GROUP BY`** (or define a sensible default 
that excludes `time`).
   4. **NULL handling for UNPIVOT.** Default drops NULL values; support 
`INCLUDE NULLS`.
   5. **Type unification for the UNPIVOT `value` column.** When source columns 
differ in type, do we implicit-cast or require explicit casts (DuckDB requires 
explicit casts)?
   6. **Generated column naming** rules, and how non-identifier pivot values 
(e.g. values with spaces) are quoted.
   7. **Scope/phasing** of multi-column `ON`, multiple aggregates, and multiple 
value columns.
   
   ### Prior Art
   
   | Engine | PIVOT | UNPIVOT | Auto-detect (dynamic) columns | 
Friendly/simplified syntax |
   |---|:---:|:---:|:---:|:---:|
   | DuckDB (reference) | ✅ | ✅ | ✅ | ✅ (`ON`/`USING`/`INTO`) |
   | SQL Server | ✅ | ✅ | ❌ (must list values) | ❌ |
   | Oracle | ✅ | ✅ | ❌ | ❌ |
   | Snowflake | ✅ | ✅ | ✅ (`ANY` / subquery) | ❌ |
   | Spark SQL | ✅ | ✅ | ❌ (must list values) | ❌ |
   
   ### Acceptance Criteria
   
   - [ ] `PIVOT` with `ON` / `USING` / `GROUP BY` and an explicit `IN (...)` 
list (static output schema).
   - [ ] `UNPIVOT` with `INTO NAME ... VALUE ...` (simplified) and/or `FOR ... 
IN (...)` (SQL-standard).
   - [ ] `INCLUDE NULLS` option for `UNPIVOT` (default = drop NULLs).
   - [ ] `PIVOT` / `UNPIVOT` usable inside subqueries and CTEs.
   - [ ] Documented column-naming and value-column type rules.
   - [ ] Tests + user documentation.
   - [ ] (Phase 2) dynamic column auto-detection, multiple aggregates, 
`COLUMNS(* EXCLUDE ...)`, multiple value columns.
   
   ### References
   
   - DuckDB — PIVOT statement: 
https://duckdb.org/docs/stable/sql/statements/pivot
   - DuckDB — UNPIVOT statement: 
https://duckdb.org/docs/stable/sql/statements/unpivot
   - DuckDB — Friendly SQL (background): 
https://duckdb.org/2023/08/23/even-friendlier-sql
   
   ---
   
   > Note: this can be split into two independently claimable tasks — **PIVOT** 
and **UNPIVOT** — if preferred.
   >
   > _Suggested labels: `enhancement`, table model / SQL._


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to