Hi Xinyu,
Thanks for driving this PIP.
+1 for this!
Left some comments:
1. Configuration option naming inconsistency
The "Public Interfaces" section says:
▎ <column>.columnar-extend.enabled = true
But the "Proposed Changes" section uses:
▎ <column>.map-storage-layout = 'extend'
These are two different APIs for enabling the same feature. Which
one is it? The enum-based map-storage-layout is more extensible, but
the doc needs to be self-consistent. I'd recommend the enum approach
(map-storage-layout = extend) and dropping the .enabled flag.
2. __field_mapping — strings or int IDs?
The physical layout examples show [usage, load, iowait] (strings),
but the text says:
▎ "The field name <-> id dictionary is stored in file metadata;
__field_mapping holds only int ids"
The examples should be corrected to show int IDs (e.g., [0, 1, 2])
to avoid confusion.
You should describe a specific example, specifically how it is stored
and how the mapping between actual data and real data is.
3. Predicate pushdown correctness is under-specified
The query path says:
▎ "From file metadata, look up the dictionary: usage → its physical
column set S (e.g., {col_0}). Translate the logical predicate usage >
30 into a physical sub-column predicate over S (col_0 > 30)."
This is only correct if col_0 always holds "usage" within the entire
row group. But the design allows different rows to map different
fields to the same physical column. If col_0 holds "usage" in row 1
and "rss" in row 5, then a row-group-level min/max stat on col_0 is
meaningless — it mixes values from different logical fields.
The doc needs to explicitly address:
- Within a row group, is a physical column guaranteed to always map
to the same logical field? If yes, this is a hard constraint the
writer must enforce (which limits flexibility). If no, predicate
pushdown can only use the stats as a coarse pre-filter and must verify
via __field_mapping per row.
- How does the "writer column layout optimization" ensure this
invariant? What happens when it can't (e.g., two rows in the same
group map different fields to col_0)?
This is the most critical correctness concern in the design.
4. Column assignment algorithm is missing
▎ "Writer performs column layout optimization while writing:
consecutive rows with similar field sets keep consistent physical
column positions, minimizing read amplification."
This is a one-sentence hand-wave over what is arguably the most
important algorithmic component. The assignment strategy directly
impacts:
- Whether predicate pushdown can work at all (see #3)
- Read amplification (how many physical columns you need to read for
one logical field)
- Overflow frequency
I'd expect the design to specify at least the basic algorithm. For example:
- Is it a greedy first-fit per row?
- Is there a "dominant mapping" per row group established from the
first N rows?
- What happens when a row's field set conflicts with the established mapping?
5. Overflow strategy and K sizing
With K=16 (default) and the doc's own example of 5~50 fields per
row, many rows will have 34+ fields in overflow. The overflow uses
MAP<INT, T> — which has the same "no columnar access" problem the
whole design is trying to solve.
The doc says "persistent overflow drives K up in later files" but:
- What's the adaptation formula? K = max(row_width) from the last
file? A percentile (p95, p99)?
- Is there an upper bound on K? With 50,000 possible fields, K could
grow unbounded.
- For the initial file, the user-configured K=16 may be a bad
default for "5~50 fields per row" scenarios.
The doc should provide clearer guidance on K sizing and set
expectations about overflow rates.
6. Relationship with existing VARIANT shredding infrastructure
The Paimon codebase already has a mature VARIANT shredding
implementation (PaimonShreddingUtils, VariantSchema,
VariantShreddingWriter, inference infrastructure in
InferVariantShreddingSchema). There are clear architectural parallels:
- Both decompose a semi-structured type into typed sub-columns
- Both need a mapping/metadata layer to reconstruct the original value
- Both integrate with Parquet/ORC reader/writer pipelines
The PIP should discuss whether code can be shared or patterns
reused. For example, the inference mechanism
(InferVariantShreddingSchema) could inform the adaptive K algorithm.
7. Memory/space overhead for sparse rows
If K=16 but a row has only 2 fields, 14 physical columns are NULL.
Plus each row carries a LIST<INT> for __field_mapping. For rows with
small field counts, this struct-based layout may actually be worse
than the default MAP storage in terms of space.
The doc should include a rough analysis of when columnar-extend
breaks even versus default MAP, so users can make informed decisions
about when to enable it.
8. __field_mapping as LIST<INT> — length vs. K
If __field_mapping has length K (one entry per physical column), a
missing field in position i could be represented as a sentinel (-1 or
similar). If it has variable length equal to the number of non-null
fields, you need a way to know which column each entry maps to.
The doc's example shows [usage, load, --] where -- seems to mean "no
field", suggesting fixed-length K. This should be made explicit. A
fixed-length LIST where each position corresponds to a physical column
is simpler but less compressible; a variable-length list is more
compact but requires position metadata.
9. Read path — missing detail on column pruning
The query path says:
▎ "Issue one read with physical schema {__field_mapping} + S"
But if __field_mapping is always required for every query (to
confirm which column holds which field), it becomes a mandatory read
cost. For queries that touch a single field, this is acceptable. For
queries that don't touch the MAP at all, does the reader skip
__field_mapping entirely? This should be explicit.
---
Minor Issues
- The __overflow type is shown as MAP<INT, T> (int keys) in the
struct definition but {steal: 0.3} (string keys) in the example.
Should be consistent.
- The comparison table says "Predicate pushdown: All keys" for
columnar-extend, but as discussed in #3, this is only true under
specific column assignment constraints that aren't guaranteed.
- The "opt-in" column name <column>.columnar-extend.enabled uses a
different column property prefix convention than the typical Paimon
table options — worth aligning with existing conventions.
---
Best,
Jingsong
On Wed, Jun 3, 2026 at 4:29 PM 刘欣瑀 <[email protected]> wrote:
>
> Hi everyone,
>
> I'd like to start a discussion on a storage optimization for `MAP<STRING, T>`
> columns targeting time-series, IoT, observability, and similar scenarios.
>
> ### Problem
>
> In these workloads, data is **"globally heterogeneous but locally
> homogeneous"** — the global key union across all rows can reach tens of
> thousands, but each row only carries 5~50 keys that are highly repetitive
> within groups (e.g., the same reportor always reports `{usage, load,
> iowait}`).
>
> Current options all fall short:
>
> - **Default MAP storage** (KV arrays): no per-key predicate pushdown, no
> per-key column pruning, no per-key statistics.
>
> - **VARIANT**: unshredded fields (>90% in these scenarios) fall into a binary
> blob, losing all columnar advantages.
>
> - **Wide table**: flattening 50,000+ fields into columns results in >99%
> NULL, with metadata explosion and unbounded schema churn.
>
> ### Proposed Solution: Columnar-Extend
>
> We propose an **opt-in storage optimization** for MAP columns — enabled via a
> table option:
>
> ```sql
>
> CREATE TABLE metrics (
>
> ts TIMESTAMP,
>
> metric STRING,
>
> ext-map MAP<STRING, DOUBLE>
>
> ) WITH (
>
> 'ext-map.map-storage-layout' = 'extend',
>
> 'ext-map.columnar-extend.num-columns' = '16'
>
> );
>
> ```
>
> The key idea: instead of storing MAP entries as KV arrays, physically rewrite
> them into a **Struct with `K` typed reusable columns** plus a lightweight
> `__field_mapping`. This gives every key full columnar treatment — predicate
> pushdown, column pruning, native statistics — while keeping the column count
> bounded at `K` (tens, not tens of thousands). Rows exceeding `K` keys spill
> into a small overflow map, so correctness never depends on `K` being large
> enough. `K` adapts across files based on the actual data width.
>
> The logical type stays `MAP<STRING, T>` — the optimization is transparent to
> users. Existing queries like `ext-map['usage'] > 30` work unchanged; the
> engine translates them into physical sub-column predicates internally.
>
> ### PIP Document
>
> The full proposal — including physical layout, query path, public interface
> changes, and rejected alternatives — is available here:
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-43%3A+Columnar-Extend+Storage+Optimization+for+MAP+Type+in+Paimon
>
> ### Looking for Feedback
>
> I'd appreciate community feedback on:
>
> 1. The overall approach — e.g., column count exceeding K with `__overflow`
> vs. other strategies.
>
> 2. The configuration design (`map-storage-layout` enum, `num-columns`).
>
> 3. Any concerns about compatibility.
>
> 4. Additional use cases — beyond time-series/IoT/observability, are there
> other scenarios in your workloads where MAP columns have high-cardinality,
> locally-repetitive keys that would benefit from this optimization?
>
> Looking forward to the discussion!
>
> Best regards,
>
> Xinyu