Hi Xinyu,

Thanks for driving this PIP.

+1 for this!

Left some comments:

  1. Configuration option naming inconsistency

  The "Public Interfaces" section says:

  ▎ <column>.columnar-extend.enabled = true

  But the "Proposed Changes" section uses:

  ▎ <column>.map-storage-layout = 'extend'

  These are two different APIs for enabling the same feature. Which
one is it? The enum-based map-storage-layout is more extensible, but
the doc needs to be self-consistent. I'd recommend the enum approach
(map-storage-layout = extend) and dropping the .enabled flag.

  2. __field_mapping — strings or int IDs?

  The physical layout examples show [usage, load, iowait] (strings),
but the text says:

  ▎ "The field name <-> id dictionary is stored in file metadata;
__field_mapping holds only int ids"

  The examples should be corrected to show int IDs (e.g., [0, 1, 2])
to avoid confusion.

You should describe a specific example, specifically how it is stored
and how the mapping between actual data and real data is.

  3. Predicate pushdown correctness is under-specified

  The query path says:

  ▎ "From file metadata, look up the dictionary: usage → its physical
column set S (e.g., {col_0}). Translate the logical predicate usage >
30 into a physical sub-column predicate over S (col_0 > 30)."

  This is only correct if col_0 always holds "usage" within the entire
row group. But the design allows different rows to map different
fields to the same physical column. If col_0 holds "usage" in row 1
and "rss" in row 5, then a row-group-level min/max stat on col_0 is
meaningless — it mixes values from different logical fields.

  The doc needs to explicitly address:

  - Within a row group, is a physical column guaranteed to always map
to the same logical field? If yes, this is a hard constraint the
writer must enforce (which limits flexibility). If no, predicate
pushdown can only use the stats as a coarse pre-filter and must verify
via __field_mapping per row.
  - How does the "writer column layout optimization" ensure this
invariant? What happens when it can't (e.g., two rows in the same
group map different fields to col_0)?

  This is the most critical correctness concern in the design.

  4. Column assignment algorithm is missing

  ▎ "Writer performs column layout optimization while writing:
consecutive rows with similar field sets keep consistent physical
column positions, minimizing read amplification."

  This is a one-sentence hand-wave over what is arguably the most
important algorithmic component. The assignment strategy directly
impacts:

  - Whether predicate pushdown can work at all (see #3)
  - Read amplification (how many physical columns you need to read for
one logical field)
  - Overflow frequency

  I'd expect the design to specify at least the basic algorithm. For example:
  - Is it a greedy first-fit per row?
  - Is there a "dominant mapping" per row group established from the
first N rows?
  - What happens when a row's field set conflicts with the established mapping?

  5. Overflow strategy and K sizing

  With K=16 (default) and the doc's own example of 5~50 fields per
row, many rows will have 34+ fields in overflow. The overflow uses
MAP<INT, T> — which has the same "no columnar access" problem the
whole design is trying to solve.

  The doc says "persistent overflow drives K up in later files" but:
  - What's the adaptation formula? K = max(row_width) from the last
file? A percentile (p95, p99)?
  - Is there an upper bound on K? With 50,000 possible fields, K could
grow unbounded.
  - For the initial file, the user-configured K=16 may be a bad
default for "5~50 fields per row" scenarios.

  The doc should provide clearer guidance on K sizing and set
expectations about overflow rates.

  6. Relationship with existing VARIANT shredding infrastructure

  The Paimon codebase already has a mature VARIANT shredding
implementation (PaimonShreddingUtils, VariantSchema,
VariantShreddingWriter, inference infrastructure in
InferVariantShreddingSchema). There are clear architectural parallels:

  - Both decompose a semi-structured type into typed sub-columns
  - Both need a mapping/metadata layer to reconstruct the original value
  - Both integrate with Parquet/ORC reader/writer pipelines

  The PIP should discuss whether code can be shared or patterns
reused. For example, the inference mechanism
(InferVariantShreddingSchema) could inform the adaptive K algorithm.

  7. Memory/space overhead for sparse rows

  If K=16 but a row has only 2 fields, 14 physical columns are NULL.
Plus each row carries a LIST<INT> for __field_mapping. For rows with
small field counts, this struct-based layout may actually be worse
than the default MAP storage in terms of space.

  The doc should include a rough analysis of when columnar-extend
breaks even versus default MAP, so users can make informed decisions
about when to enable it.

  8. __field_mapping as LIST<INT> — length vs. K

  If __field_mapping has length K (one entry per physical column), a
missing field in position i could be represented as a sentinel (-1 or
similar). If it has variable length equal to the number of non-null
fields, you need a way to know which column each entry maps to.

  The doc's example shows [usage, load, --] where -- seems to mean "no
field", suggesting fixed-length K. This should be made explicit. A
fixed-length LIST where each position corresponds to a physical column
is simpler but less compressible; a variable-length list is more
compact but requires position metadata.

  9. Read path — missing detail on column pruning

  The query path says:

  ▎ "Issue one read with physical schema {__field_mapping} + S"

  But if __field_mapping is always required for every query (to
confirm which column holds which field), it becomes a mandatory read
cost. For queries that touch a single field, this is acceptable. For
queries that don't touch the MAP at all, does the reader skip
__field_mapping entirely? This should be explicit.

  ---
  Minor Issues

  - The __overflow type is shown as MAP<INT, T> (int keys) in the
struct definition but {steal: 0.3} (string keys) in the example.
Should be consistent.
  - The comparison table says "Predicate pushdown: All keys" for
columnar-extend, but as discussed in #3, this is only true under
specific column assignment constraints that aren't guaranteed.
  - The "opt-in" column name <column>.columnar-extend.enabled uses a
different column property prefix convention than the typical Paimon
table options — worth aligning with existing conventions.

  ---

Best,
Jingsong

On Wed, Jun 3, 2026 at 4:29 PM 刘欣瑀 <[email protected]> wrote:
>
> Hi everyone,
>
> I'd like to start a discussion on a storage optimization for `MAP<STRING, T>` 
> columns targeting time-series, IoT, observability, and similar scenarios.
>
> ### Problem
>
> In these workloads, data is **"globally heterogeneous but locally 
> homogeneous"** — the global key union across all rows can reach tens of 
> thousands, but each row only carries 5~50 keys that are highly repetitive 
> within groups (e.g., the same reportor always reports `{usage, load, 
> iowait}`).
>
> Current options all fall short:
>
> - **Default MAP storage** (KV arrays): no per-key predicate pushdown, no 
> per-key column pruning, no per-key statistics.
>
> - **VARIANT**: unshredded fields (>90% in these scenarios) fall into a binary 
> blob, losing all columnar advantages.
>
> - **Wide table**: flattening 50,000+ fields into columns results in >99% 
> NULL, with metadata explosion and unbounded schema churn.
>
> ### Proposed Solution: Columnar-Extend
>
> We propose an **opt-in storage optimization** for MAP columns — enabled via a 
> table option:
>
> ```sql
>
> CREATE TABLE metrics (
>
> ts TIMESTAMP,
>
> metric STRING,
>
> ext-map MAP<STRING, DOUBLE>
>
> ) WITH (
>
> 'ext-map.map-storage-layout' = 'extend',
>
> 'ext-map.columnar-extend.num-columns' = '16'
>
> );
>
> ```
>
> The key idea: instead of storing MAP entries as KV arrays, physically rewrite 
> them into a **Struct with `K` typed reusable columns** plus a lightweight 
> `__field_mapping`. This gives every key full columnar treatment — predicate 
> pushdown, column pruning, native statistics — while keeping the column count 
> bounded at `K` (tens, not tens of thousands). Rows exceeding `K` keys spill 
> into a small overflow map, so correctness never depends on `K` being large 
> enough. `K` adapts across files based on the actual data width.
>
> The logical type stays `MAP<STRING, T>` — the optimization is transparent to 
> users. Existing queries like `ext-map['usage'] > 30` work unchanged; the 
> engine translates them into physical sub-column predicates internally.
>
> ### PIP Document
>
> The full proposal — including physical layout, query path, public interface 
> changes, and rejected alternatives — is available here: 
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-43%3A+Columnar-Extend+Storage+Optimization+for+MAP+Type+in+Paimon
>
> ### Looking for Feedback
>
> I'd appreciate community feedback on:
>
> 1. The overall approach — e.g., column count exceeding K with `__overflow` 
> vs. other strategies.
>
> 2. The configuration design (`map-storage-layout` enum, `num-columns`).
>
> 3. Any concerns about compatibility.
>
> 4. Additional use cases — beyond time-series/IoT/observability, are there 
> other scenarios in your workloads where MAP columns have high-cardinality, 
> locally-repetitive keys that would benefit from this optimization?
>
> Looking forward to the discussion!
>
> Best regards,
>
> Xinyu

Reply via email to