DaZuiZui commented on issue #17799:
URL: https://github.com/apache/iotdb/issues/17799#issuecomment-4620232088

   I would like to propose implementing #17799 in phases.
   
   This first PR will address only the **UNPIVOT** part of #17799. **PIVOT** 
and advanced UNPIVOT features will be handled in follow-up PRs **after the 
Phase 1 PR has been reviewed and merged**.
   
   The first PR should be marked as `Part of #17799`.
   
   It should not use `Fixes #17799` or `Closes #17799`, because this PR will 
not complete the whole issue.
   
   ## Phase 1: UNPIVOT v1
   
   ## Feature Definition
   
   `UNPIVOT` converts multiple measurement columns in the IoTDB table model 
from a **wide-table** format into a **long-table** format.
   
   Before UNPIVOT:
   
   ```text
   time, device_id, region, temperature, humidity
   ```
   
   After UNPIVOT:
   
   ```text
   time, device_id, region, measurement, value
   ```
   
   Here:
   
   - `measurement` stores the original source column name, such as 
`temperature` or `humidity`.
   - `value` stores the corresponding source column value.
   
   This is useful for IoT scenarios where measurements are stored as separate 
columns, but downstream analytics, export, or visualization prefers a 
normalized `(measurement, value)` format.
   
   ## Syntax
   
   `UNPIVOT v1` will support a subset of DuckDB simplified UNPIVOT syntax, with 
`INCLUDE NULLS` added to satisfy the NULL handling requirement in #17799.
   
   ```sql
   UNPIVOT [INCLUDE NULLS] source_relation
   ON value_column [, value_column ...]
   INTO NAME name_column VALUE value_column_alias;
   ```
   
   Example:
   
   ```sql
   UNPIVOT device_metrics
   ON temperature, humidity
   INTO NAME measurement VALUE value;
   ```
   
   It can also be used in subqueries:
   
   ```sql
   SELECT *
   FROM (
     UNPIVOT device_metrics
     ON temperature, humidity
     INTO NAME measurement VALUE value
   );
   ```
   
   And with CTEs:
   
   ```sql
   WITH long_metrics AS (
     UNPIVOT device_metrics
     ON temperature, humidity
     INTO NAME measurement VALUE value
   )
   SELECT * FROM long_metrics;
   ```
   
   ## Syntax Components
   
   ### `UNPIVOT`
   
   Performs the wide-to-long transformation.
   
   ### `INCLUDE NULLS`
   
   Optional.
   
   By default, rows whose unpivoted value is `NULL` are dropped. If `INCLUDE 
NULLS` is specified, those rows are kept.
   
   ### `source_relation`
   
   The input relation.
   
   In v1, I propose to support:
   
   - normal table names
   - CTE names
   - parenthesized subqueries
   
   For more complex inputs such as joins, users can wrap them in a subquery or 
CTE first.
   
   ### `ON value_column [, ...]`
   
   Specifies the source columns to unpivot.
   
   In v1, only explicit column names are supported. The following are not 
included in this phase:
   
   - expressions
   - `CAST`
   - wildcards
   - dynamic column selection
   - `COLUMNS(* EXCLUDE (...))`
   
   ### `INTO NAME name_column`
   
   Specifies the generated column that stores the original source column name.
   
   The generated name column should use the existing text type in the IoTDB 
table model, either `STRING` or `TEXT` depending on the current internal type 
system.
   
   ### `VALUE value_column_alias`
   
   Specifies the generated column that stores the original source column value.
   
   Its type is the same as the strictly identical type of all columns listed in 
the `ON` clause.
   
   ## Output Columns
   
   The output columns are:
   
   1. visible columns from `source_relation` that are not listed in `ON`
   2. `name_column`
   3. `value_column_alias`
   
   For example, given source columns:
   
   ```text
   time, device_id, region, temperature, humidity
   ```
   
   The query:
   
   ```sql
   UNPIVOT device_metrics
   ON temperature, humidity
   INTO NAME measurement VALUE value;
   ```
   
   produces:
   
   ```text
   time, device_id, region, measurement, value
   ```
   
   `temperature` and `humidity` are no longer output as separate columns.
   
   ## NULL Semantics
   
   By default, `UNPIVOT` drops `NULL` values.
   
   ```sql
   UNPIVOT device_metrics
   ON temperature, humidity
   INTO NAME measurement VALUE value;
   ```
   
   is semantically equivalent to:
   
   ```sql
   SELECT time, device_id, region,
          'temperature' AS measurement,
          temperature AS value
   FROM device_metrics
   WHERE temperature IS NOT NULL
   
   UNION ALL
   
   SELECT time, device_id, region,
          'humidity' AS measurement,
          humidity AS value
   FROM device_metrics
   WHERE humidity IS NOT NULL;
   ```
   
   With `INCLUDE NULLS`, the `IS NOT NULL` filters are not applied.
   
   `UNPIVOT` should not guarantee output row order. Users should use `ORDER BY` 
if they need stable ordering. Tests should also use explicit ordering.
   
   ## Type Rules
   
   In v1, all columns listed in the `ON` clause must have exactly the same type.
   
   Supported types should follow the existing IoTDB table model scalar types, 
such as:
   
   ```text
   BOOLEAN
   INT32
   INT64
   FLOAT
   DOUBLE
   TEXT / STRING
   TIMESTAMP
   DATE
   BLOB
   ```
   
   There should be no automatic type coercion in v1. For example:
   
   - `INT32 + INT64` should not be automatically coerced.
   - `FLOAT + DOUBLE` should not be automatically coerced.
   - `TEXT + STRING` should follow the existing IoTDB type system.
   
   This strict type check should be performed in the `UNPIVOT` analyzer before 
planning. We should not rely only on existing `UNION ALL` analysis, because set 
operations may find a common super type, which would conflict with the strict 
v1 semantics.
   
   ## Validation Rules
   
   `UNPIVOT v1` should validate that:
   
   - `ON` columns exist
   - `ON` columns are not duplicated
   - `ON` columns are visible columns
   - `name_column` does not conflict with retained output columns
   - `value_column_alias` does not conflict with retained output columns
   - `name_column` and `value_column_alias` do not conflict with each other
   - quoted identifier and case-insensitive conflicts follow existing IoTDB 
identifier rules
   
   ## Implementation Plan
   
   I propose not to rewrite `UNPIVOT` into `UNION ALL` in the pure parser or 
pre-analyzer rewrite phase, because source visible columns, hidden columns, 
types, and CTE/subquery schemas are only known after analysis.
   
   A safer implementation flow is:
   
   1. Add `UNPIVOT` as a `queryPrimary` / query body alternative in the 
relational SQL grammar.
   2. Parser builds an `Unpivot` AST node.
   3. Analyzer analyzes `source_relation` first, then validates `ON` columns, 
visibility, duplicates, output-name conflicts, and strict same-type rules.
   4. Analyzer determines the static output schema for the `Unpivot` node.
   5. Planner expands `Unpivot` into an equivalent `UNION ALL` plan, or reuses 
existing set operation planning where possible.
   
   Placing `UNPIVOT` at the `queryPrimary` / query body level makes it 
naturally support top-level usage, subqueries, and CTEs, which matches the 
issue requirement and DuckDB simplified syntax.
   
   ## Test Plan
   
   The first phase should cover at least:
   
   - top-level `UNPIVOT`
   - CTE input
   - subquery input
   - default NULL dropping
   - `INCLUDE NULLS`
   - missing `ON` column
   - duplicate `ON` column
   - `name` / `value` output column conflict
   - mismatched value column types
   - `INT32 + INT64` not automatically coerced
   - quoted identifier / case-insensitive conflicts
   
   ## Phased Plan
   
   I plan to implement this issue in phases.
   
   Phase 1 will focus only on `UNPIVOT v1`. It will support:
   
   - explicit value columns
   - static output schema
   - default NULL dropping
   - optional `INCLUDE NULLS`
   - strict same-type validation
   - usage as a top-level query, subquery, or CTE
   
   After the Phase 1 PR is reviewed and merged, I will start Phase 2 in 
follow-up PRs.
   
   Phase 2 will cover `PIVOT` and advanced `UNPIVOT` features, such as:
   
   - dynamic column selection
   - `COLUMNS(* EXCLUDE (...))`
   - expressions
   - `CAST`
   - automatic type coercion
   - multiple `VALUE` columns
   
   The reason for waiting until Phase 1 is merged before starting Phase 2 is 
that `UNPIVOT v1` introduces the basic grammar, AST, analyzer, planner, and 
test structure for this family of syntax. It is better to first get agreement 
from reviewers on these foundations before adding more complex features on top 
of them.
   
   This split also keeps the first PR small and reviewable. `UNPIVOT v1` has a 
static output schema, so it can be analyzed and planned safely. `PIVOT`, 
especially dynamic `PIVOT` without an explicit `IN` list, may require runtime 
discovery of distinct values, which makes output schema determination more 
complex.
   
   Doing `PIVOT` before the `UNPIVOT` foundation is reviewed could increase 
implementation risk and make the PR harder to review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to