*Hi Apache IoTDB community,*
*issue: *https://github.com/apache/iotdb/issues/17799
I would like to propose implementing* #17799 *in phases.
This first PR will address only the *UNPIVOT part of #17799*. *PIVOT and
advanced UNPIVOT* features will be handled in follow-up PRs *after the
Phase 1 PR has been reviewed and merged*.
The first *PR* should be marked as: *Part of #17799*
It should not use Fixes #17799 or Closes #17799, *because this PR will not
complete the whole issue.*
Feature Definition
UNPIVOT converts multiple measurement columns in the IoTDB table model from
a wide-table format into a long-table format.
Before UNPIVOT:
time, device_id, region, temperature, humidity
After UNPIVOT:
time, device_id, region, measurement, value
Here:
- measurement stores the original source column name, such as temperature
or humidity.
- value stores the corresponding source column value.
This is useful for IoT scenarios where measurements are stored as separate
columns, but downstream analytics, export, or visualization prefers a
normalized (measurement, value) format.
Syntax
UNPIVOT v1 will support a subset of DuckDB simplified UNPIVOT syntax,
with INCLUDE
NULLS added to satisfy the NULL handling requirement in #17799.
UNPIVOT [INCLUDE NULLS] source_relation ON value_column [, value_column ...]
INTO NAME name_column VALUE value_column_alias;
Example:
UNPIVOT device_metrics ON temperature, humidity INTO NAME measurement VALUE
value;
It can also be used in subqueries:
SELECT * FROM ( UNPIVOT device_metrics ON temperature, humidity INTO NAME
measurement VALUE value );
And with CTEs:
WITH long_metrics AS ( UNPIVOT device_metrics ON temperature, humidity INTO
NAME measurement VALUE value ) SELECT * FROM long_metrics;
Syntax ComponentsUNPIVOT
Performs the wide-to-long transformation.
INCLUDE NULLS
Optional.
By default, rows whose unpivoted value is NULL are dropped. If INCLUDE NULLS is
specified, those rows are kept.
source_relation
The input relation.
In v1, I propose to support:
- normal table names
- CTE names
- parenthesized subqueries
For more complex inputs such as joins, users can wrap them in a subquery or
CTE first.
ON value_column [, ...]
Specifies the source columns to unpivot.
In v1, only explicit column names are supported. The following are not
included in this phase:
- expressions
- CAST
- wildcards
- dynamic column selection
- COLUMNS(* EXCLUDE (...))
INTO NAME name_column
Specifies the generated column that stores the original source column name.
The generated name column should use the existing text type in the IoTDB
table model, either STRING or TEXT depending on the current internal type
system.
VALUE value_column_alias
Specifies the generated column that stores the original source column value.
Its type is the same as the strictly identical type of all columns listed
in the ON clause.
Output Columns
The output columns are:
1. visible columns from source_relation that are not listed in ON
2. name_column
3. value_column_alias
For example, given source columns:
time, device_id, region, temperature, humidity
The query:
UNPIVOT device_metrics ON temperature, humidity INTO NAME measurement VALUE
value;
Produces:
time, device_id, region, measurement, value
temperature and humidity are no longer output as separate columns.
NULL Semantics
By default, UNPIVOT drops NULL values.
UNPIVOT device_metrics ON temperature, humidity INTO NAME measurement VALUE
value;
Is semantically equivalent to:
SELECT time, device_id, region, 'temperature' AS measurement, temperature AS
value FROM device_metrics WHERE temperature IS NOT NULL UNION ALL SELECT
time, device_id, region, 'humidity' AS measurement, humidity AS value FROM
device_metrics WHERE humidity IS NOT NULL;
With INCLUDE NULLS, the IS NOT NULL filters are not applied.
UNPIVOT should not guarantee output row order. Users should use ORDER BY if
they need stable ordering. Tests should also use explicit ordering.
Type Rules
In v1, all columns listed in the ON clause must have exactly the same type.
Supported types should follow the existing IoTDB table model scalar types,
such as:
BOOLEAN INT32 INT64 FLOAT DOUBLE TEXT / STRING TIMESTAMP DATE BLOB
There should be no automatic type coercion in v1. For example:
- INT32 + INT64 should not be automatically coerced.
- FLOAT + DOUBLE should not be automatically coerced.
- TEXT + STRING should follow the existing IoTDB type system.
This strict type check should be performed in the UNPIVOT analyzer before
planning. We should not rely only on existing UNION ALL analysis, because
set operations may find a common super type, which would conflict with the
strict v1 semantics.
Validation Rules
UNPIVOT v1 should validate that:
- ON columns exist
- ON columns are not duplicated
- ON columns are visible columns
- name_column does not conflict with retained output columns
- value_column_alias does not conflict with retained output columns
- name_column and value_column_alias do not conflict with each other
- quoted identifier and case-insensitive conflicts follow existing IoTDB
identifier rules
Implementation Plan
I propose not to rewrite UNPIVOT into UNION ALL in the pure parser or
pre-analyzer rewrite phase, because source visible columns, hidden columns,
types, and CTE/subquery schemas are only known after analysis.
A safer implementation flow is:
1. Add UNPIVOT as a queryPrimary / query body alternative in the
relational SQL grammar.
2. Parser builds an Unpivot AST node.
3. Analyzer analyzes source_relation first, then validates ON columns,
visibility, duplicates, output-name conflicts, and strict same-type rules.
4. Analyzer determines the static output schema for the Unpivot node.
5. Planner expands Unpivot into an equivalent UNION ALL plan, or reuses
existing set operation planning where possible.
Placing UNPIVOT at the queryPrimary / query body level makes it naturally
support top-level usage, subqueries, and CTEs, which matches the issue
requirement and DuckDB simplified syntax.
Test Plan
The first phase should cover at least:
- top-level UNPIVOT
- CTE input
- subquery input
- default NULL dropping
- INCLUDE NULLS
- missing ON column
- duplicate ON column
- name / value output column conflict
- mismatched value column types
- INT32 + INT64 not automatically coerced
- quoted identifier / case-insensitive conflicts
Phased Plan
I plan to implement this issue in phases.
Phase 1 will focus only on UNPIVOT v1. It will support:
- explicit value columns
- static output schema
- default NULL dropping
- optional INCLUDE NULLS
- strict same-type validation
- usage as a top-level query, subquery, or CTE
After the Phase 1 PR is reviewed and merged, I will start Phase 2 in
follow-up PRs.
Phase 2 will cover PIVOT and advanced UNPIVOT features, such as:
- dynamic column selection
- COLUMNS(* EXCLUDE (...))
- expressions
- CAST
- automatic type coercion
- multiple VALUE columns
*The reason for waiting until Phase 1 is merged before starting Phase 2 is
that UNPIVOT v1 introduces the basic grammar, AST, analyzer, planner, and
test structure for this family of syntax. It is better to first get
agreement from reviewers on these foundations before adding more complex
features on top of them.*
This split also keeps the first PR small and reviewable. UNPIVOT v1 has a
static output schema, so it can be analyzed and planned safely. PIVOT,
especially dynamic PIVOT without an explicit IN list, may require runtime
discovery of distinct values, which makes output schema determination more
complex.
Doing PIVOT before the UNPIVOT foundation is reviewed could increase
implementation risk and make the PR harder to review.
Best regards,
Bryan Yang(杨易达)