Re: [DISCUSS] PIP-43 Columnar-Extend Storage Optimization for MAP Type

Jingsong Li Thu, 04 Jun 2026 01:20:44 -0700

Move to this correct thread, from Xinyu:

Hi, Jingsong,

Thank you for the thorough and insightful review — these comments were
extremely helpful and meaningfully sharpened the design. I've updated
the PIP to address every point. A summary per item:

1. **Config naming**. Agreed — dropped the .enabled flag entirely and
standardized on the enum <column>.map-storage-layout = extend
throughout. The only additional knob is
<column>.columnar-extend.max-columns (the K_max cap).

2. **__field_mapping — int ids**. All examples now use int ids ([0, 1,
2], with -1 for an empty slot). I added a **worked example** showing
the file-level name ↔ id dictionary and a step-by-step reconstruction
of each row back into MAP<STRING, DOUBLE>, including an overflow row.

3. **Predicate pushdown correctness**. This was the most important
point. The design **intentionally imposes no hard invariant** that a
physical column always holds the same logical field within a row group
— different rows may map different fields to the same column. So
pushdown is treated strictly as a **coarse pre-filter**: column
statistics can only skip blocks that provably cannot match, and a
column that happens to mix fields is merely less selective (we read a
few extra rows). The exact answer is always produced by
re-constructing each row via __field_mapping after reading.

What makes this effective in practice is the **natural locality of the
target workloads**: rows of the same metric write in long contiguous
runs sharing one field pattern, and the allocator pins that pattern to
the same columns for the run. Since pushdown stats are fine-grained
(page-level in Parquet, row-group/stripe-level in ORC), the vast
majority of stat blocks hold a single logical field per physical
column, so their min/max stay tight and page / row-group pruning keeps
a high filtering ratio.

4. **Column assignment algorithm**. Added the streaming per-row
allocator (ExtendColumnAllocator), which maintains an in-memory column
→ owning field state across rows: Hit (a reused field keeps its
column), Evict (a new field takes a free column, else the LRU column
is evicted), Retain (untouched columns keep their owner, so stable
groups stay stable), Overflow (extras beyond K spill). To your
specific questions: it is not a per-row first-fit, and there is no
frozen "dominant mapping" — the state evolves continuously; when a
row's field set conflicts the allocator simply re-pins, and
correctness never depends on conflict-freedom (it's backed by
__field_mapping).

5. **K sizing & overflow.** Specified explicitly. K_next =
min(P99_row_width(recent files), K_max), adapting in both directions
across files; K_max (default 256) bounds growth so a key explosion
can't create unbounded columns. There is no fixed default like 16 —
the first file simply starts at K_max (no prior files to adapt from),
and an over-wide first file only affects that one file before
adaptation converges. So overflow only catches long-tail rows in
steady state.

6. **VARIANT shredding.** Added a dedicated section. The inference /
reconstruct / plumbing layers (VariantShreddingWriter's ShreddedResult
builder, PaimonShreddedRow + RowToColumnConverter, and ShreddingUtils
/ VariantUtils) are good candidates to refactor and reuse. The genuine
difference is the **write-path column layout** — shredding binds one
fixed column per field plus a blob, whereas extend reuses K columns
per-row with a typed overflow. The PIP proposes generalizing the
shared layers rather than building a parallel stack.

7. **Sparse-row space overhead.** Added a break-even analysis:
columnar-extend pays off when rows are grouped and locally homogeneous
and each row's field count is a meaningful fraction of K; default MAP
can be preferable for very small, highly heterogeneous rows. The PIP
gives concrete guidance so users can decide when to enable it.

8. **__field_mapping length**. Made explicit: it is **fixed length
K**, one entry per physical column, with sentinel -1 for an empty
column (examples updated accordingly). Fixed length keeps position →
column deterministic with no extra position metadata, and it still
compresses well since same-group rows share an identical mapping under
RLE.

9. **Read-path pruning**. Made explicit: if a query does not reference
the MAP column, the whole struct — including __field_mapping — is
never projected and never read (struct-level column pruning).
__field_mapping is a mandatory read **only** for queries that actually
access the MAP.

I also added a short section relating this to PR #7877 (map
shredding). The two sit close on the same axis (dedicated-per-key vs.
reused-across-keys columns) and could **converge on a single
Struct-based framework** via a config switch, so I'd suggest we
explore aligning the two efforts rather than maintaining parallel
stacks.

Thank you again for taking the time on such a careful review — it
genuinely improved the proposal. I'd be very happy to discuss any of
these further, and I look forward to your thoughts.

Best regards,

Xinyu Liu

On Wed, Jun 3, 2026 at 5:23 PM Jingsong Li <[email protected]> wrote:
>
> Hi Xinyu,
>
> Thanks for driving this PIP.
>
> +1 for this!
>
> Left some comments:
>
>   1. Configuration option naming inconsistency
>
>   The "Public Interfaces" section says:
>
>   ▎ <column>.columnar-extend.enabled = true
>
>   But the "Proposed Changes" section uses:
>
>   ▎ <column>.map-storage-layout = 'extend'
>
>   These are two different APIs for enabling the same feature. Which
> one is it? The enum-based map-storage-layout is more extensible, but
> the doc needs to be self-consistent. I'd recommend the enum approach
> (map-storage-layout = extend) and dropping the .enabled flag.
>
>   2. __field_mapping — strings or int IDs?
>
>   The physical layout examples show [usage, load, iowait] (strings),
> but the text says:
>
>   ▎ "The field name <-> id dictionary is stored in file metadata;
> __field_mapping holds only int ids"
>
>   The examples should be corrected to show int IDs (e.g., [0, 1, 2])
> to avoid confusion.
>
> You should describe a specific example, specifically how it is stored
> and how the mapping between actual data and real data is.
>
>   3. Predicate pushdown correctness is under-specified
>
>   The query path says:
>
>   ▎ "From file metadata, look up the dictionary: usage → its physical
> column set S (e.g., {col_0}). Translate the logical predicate usage >
> 30 into a physical sub-column predicate over S (col_0 > 30)."
>
>   This is only correct if col_0 always holds "usage" within the entire
> row group. But the design allows different rows to map different
> fields to the same physical column. If col_0 holds "usage" in row 1
> and "rss" in row 5, then a row-group-level min/max stat on col_0 is
> meaningless — it mixes values from different logical fields.
>
>   The doc needs to explicitly address:
>
>   - Within a row group, is a physical column guaranteed to always map
> to the same logical field? If yes, this is a hard constraint the
> writer must enforce (which limits flexibility). If no, predicate
> pushdown can only use the stats as a coarse pre-filter and must verify
> via __field_mapping per row.
>   - How does the "writer column layout optimization" ensure this
> invariant? What happens when it can't (e.g., two rows in the same
> group map different fields to col_0)?
>
>   This is the most critical correctness concern in the design.
>
>   4. Column assignment algorithm is missing
>
>   ▎ "Writer performs column layout optimization while writing:
> consecutive rows with similar field sets keep consistent physical
> column positions, minimizing read amplification."
>
>   This is a one-sentence hand-wave over what is arguably the most
> important algorithmic component. The assignment strategy directly
> impacts:
>
>   - Whether predicate pushdown can work at all (see #3)
>   - Read amplification (how many physical columns you need to read for
> one logical field)
>   - Overflow frequency
>
>   I'd expect the design to specify at least the basic algorithm. For example:
>   - Is it a greedy first-fit per row?
>   - Is there a "dominant mapping" per row group established from the
> first N rows?
>   - What happens when a row's field set conflicts with the established 
> mapping?
>
>   5. Overflow strategy and K sizing
>
>   With K=16 (default) and the doc's own example of 5~50 fields per
> row, many rows will have 34+ fields in overflow. The overflow uses
> MAP<INT, T> — which has the same "no columnar access" problem the
> whole design is trying to solve.
>
>   The doc says "persistent overflow drives K up in later files" but:
>   - What's the adaptation formula? K = max(row_width) from the last
> file? A percentile (p95, p99)?
>   - Is there an upper bound on K? With 50,000 possible fields, K could
> grow unbounded.
>   - For the initial file, the user-configured K=16 may be a bad
> default for "5~50 fields per row" scenarios.
>
>   The doc should provide clearer guidance on K sizing and set
> expectations about overflow rates.
>
>   6. Relationship with existing VARIANT shredding infrastructure
>
>   The Paimon codebase already has a mature VARIANT shredding
> implementation (PaimonShreddingUtils, VariantSchema,
> VariantShreddingWriter, inference infrastructure in
> InferVariantShreddingSchema). There are clear architectural parallels:
>
>   - Both decompose a semi-structured type into typed sub-columns
>   - Both need a mapping/metadata layer to reconstruct the original value
>   - Both integrate with Parquet/ORC reader/writer pipelines
>
>   The PIP should discuss whether code can be shared or patterns
> reused. For example, the inference mechanism
> (InferVariantShreddingSchema) could inform the adaptive K algorithm.
>
>   7. Memory/space overhead for sparse rows
>
>   If K=16 but a row has only 2 fields, 14 physical columns are NULL.
> Plus each row carries a LIST<INT> for __field_mapping. For rows with
> small field counts, this struct-based layout may actually be worse
> than the default MAP storage in terms of space.
>
>   The doc should include a rough analysis of when columnar-extend
> breaks even versus default MAP, so users can make informed decisions
> about when to enable it.
>
>   8. __field_mapping as LIST<INT> — length vs. K
>
>   If __field_mapping has length K (one entry per physical column), a
> missing field in position i could be represented as a sentinel (-1 or
> similar). If it has variable length equal to the number of non-null
> fields, you need a way to know which column each entry maps to.
>
>   The doc's example shows [usage, load, --] where -- seems to mean "no
> field", suggesting fixed-length K. This should be made explicit. A
> fixed-length LIST where each position corresponds to a physical column
> is simpler but less compressible; a variable-length list is more
> compact but requires position metadata.
>
>   9. Read path — missing detail on column pruning
>
>   The query path says:
>
>   ▎ "Issue one read with physical schema {__field_mapping} + S"
>
>   But if __field_mapping is always required for every query (to
> confirm which column holds which field), it becomes a mandatory read
> cost. For queries that touch a single field, this is acceptable. For
> queries that don't touch the MAP at all, does the reader skip
> __field_mapping entirely? This should be explicit.
>
>   ---
>   Minor Issues
>
>   - The __overflow type is shown as MAP<INT, T> (int keys) in the
> struct definition but {steal: 0.3} (string keys) in the example.
> Should be consistent.
>   - The comparison table says "Predicate pushdown: All keys" for
> columnar-extend, but as discussed in #3, this is only true under
> specific column assignment constraints that aren't guaranteed.
>   - The "opt-in" column name <column>.columnar-extend.enabled uses a
> different column property prefix convention than the typical Paimon
> table options — worth aligning with existing conventions.
>
>   ---
>
> Best,
> Jingsong
>
> On Wed, Jun 3, 2026 at 4:29 PM 刘欣瑀 <[email protected]> wrote:
> >
> > Hi everyone,
> >
> > I'd like to start a discussion on a storage optimization for `MAP<STRING, 
> > T>` columns targeting time-series, IoT, observability, and similar 
> > scenarios.
> >
> > ### Problem
> >
> > In these workloads, data is **"globally heterogeneous but locally 
> > homogeneous"** — the global key union across all rows can reach tens of 
> > thousands, but each row only carries 5~50 keys that are highly repetitive 
> > within groups (e.g., the same reportor always reports `{usage, load, 
> > iowait}`).
> >
> > Current options all fall short:
> >
> > - **Default MAP storage** (KV arrays): no per-key predicate pushdown, no 
> > per-key column pruning, no per-key statistics.
> >
> > - **VARIANT**: unshredded fields (>90% in these scenarios) fall into a 
> > binary blob, losing all columnar advantages.
> >
> > - **Wide table**: flattening 50,000+ fields into columns results in >99% 
> > NULL, with metadata explosion and unbounded schema churn.
> >
> > ### Proposed Solution: Columnar-Extend
> >
> > We propose an **opt-in storage optimization** for MAP columns — enabled via 
> > a table option:
> >
> > ```sql
> >
> > CREATE TABLE metrics (
> >
> > ts TIMESTAMP,
> >
> > metric STRING,
> >
> > ext-map MAP<STRING, DOUBLE>
> >
> > ) WITH (
> >
> > 'ext-map.map-storage-layout' = 'extend',
> >
> > 'ext-map.columnar-extend.num-columns' = '16'
> >
> > );
> >
> > ```
> >
> > The key idea: instead of storing MAP entries as KV arrays, physically 
> > rewrite them into a **Struct with `K` typed reusable columns** plus a 
> > lightweight `__field_mapping`. This gives every key full columnar treatment 
> > — predicate pushdown, column pruning, native statistics — while keeping the 
> > column count bounded at `K` (tens, not tens of thousands). Rows exceeding 
> > `K` keys spill into a small overflow map, so correctness never depends on 
> > `K` being large enough. `K` adapts across files based on the actual data 
> > width.
> >
> > The logical type stays `MAP<STRING, T>` — the optimization is transparent 
> > to users. Existing queries like `ext-map['usage'] > 30` work unchanged; the 
> > engine translates them into physical sub-column predicates internally.
> >
> > ### PIP Document
> >
> > The full proposal — including physical layout, query path, public interface 
> > changes, and rejected alternatives — is available here: 
> > https://cwiki.apache.org/confluence/display/PAIMON/PIP-43%3A+Columnar-Extend+Storage+Optimization+for+MAP+Type+in+Paimon
> >
> > ### Looking for Feedback
> >
> > I'd appreciate community feedback on:
> >
> > 1. The overall approach — e.g., column count exceeding K with `__overflow` 
> > vs. other strategies.
> >
> > 2. The configuration design (`map-storage-layout` enum, `num-columns`).
> >
> > 3. Any concerns about compatibility.
> >
> > 4. Additional use cases — beyond time-series/IoT/observability, are there 
> > other scenarios in your workloads where MAP columns have high-cardinality, 
> > locally-repetitive keys that would benefit from this optimization?
> >
> > Looking forward to the discussion!
> >
> > Best regards,
> >
> > Xinyu

Re: [DISCUSS] PIP-43 Columnar-Extend Storage Optimization for MAP Type

Reply via email to