Re: Columnar-extend PIP43 — review comments addressed

刘欣瑀 Thu, 04 Jun 2026 00:07:44 -0700

Hi, Jingsong,

Thank you for the thorough and insightful review — these comments were 
extremely helpful and meaningfully sharpened the design. I've updated the PIP 
to address every point. A summary per item:

1. **Config naming**. Agreed — dropped the .enabled flag entirely and 
standardized on the enum <column>.map-storage-layout = extend throughout. The 
only additional knob is <column>.columnar-extend.max-columns (the K_max cap).

2. **__field_mapping — int ids**. All examples now use int ids ([0, 1, 2], with 
-1 for an empty slot). I added a **worked example** showing the file-level name 
↔ id dictionary and a step-by-step reconstruction of each row back into 
MAP<STRING, DOUBLE>, including an overflow row.

3. **Predicate pushdown correctness**. This was the most important point. The 
design **intentionally imposes no hard invariant** that a physical column 
always holds the same logical field within a row group — different rows may map 
different fields to the same column. So pushdown is treated strictly as a 
**coarse pre-filter**: column statistics can only skip blocks that provably 
cannot match, and a column that happens to mix fields is merely less selective 
(we read a few extra rows). The exact answer is always produced by 
re-constructing each row via __field_mapping after reading.

What makes this effective in practice is the **natural locality of the target 
workloads**: rows of the same metric write in long contiguous runs sharing one 
field pattern, and the allocator pins that pattern to the same columns for the 
run. Since pushdown stats are fine-grained (page-level in Parquet, 
row-group/stripe-level in ORC), the vast majority of stat blocks hold a single 
logical field per physical column, so their min/max stay tight and page / 
row-group pruning keeps a high filtering ratio.

4. **Column assignment algorithm**. Added the streaming per-row allocator 
(ExtendColumnAllocator), which maintains an in-memory column → owning field 
state across rows: Hit (a reused field keeps its column), Evict (a new field 
takes a free column, else the LRU column is evicted), Retain (untouched columns 
keep their owner, so stable groups stay stable), Overflow (extras beyond K 
spill). To your specific questions: it is not a per-row first-fit, and there is 
no frozen "dominant mapping" — the state evolves continuously; when a row's 
field set conflicts the allocator simply re-pins, and correctness never depends 
on conflict-freedom (it's backed by __field_mapping).

5. **K sizing & overflow.** Specified explicitly. K_next = 
min(P99_row_width(recent files), K_max), adapting in both directions across 
files; K_max (default 256) bounds growth so a key explosion can't create 
unbounded columns. There is no fixed default like 16 — the first file simply 
starts at K_max (no prior files to adapt from), and an over-wide first file 
only affects that one file before adaptation converges. So overflow only 
catches long-tail rows in steady state.

6. **VARIANT shredding.** Added a dedicated section. The inference / 
reconstruct / plumbing layers (VariantShreddingWriter's ShreddedResult builder, 
PaimonShreddedRow + RowToColumnConverter, and ShreddingUtils / VariantUtils) 
are good candidates to refactor and reuse. The genuine difference is the 
**write-path column layout** — shredding binds one fixed column per field plus 
a blob, whereas extend reuses K columns per-row with a typed overflow. The PIP 
proposes generalizing the shared layers rather than building a parallel stack.

7. **Sparse-row space overhead.** Added a break-even analysis: columnar-extend 
pays off when rows are grouped and locally homogeneous and each row's field 
count is a meaningful fraction of K; default MAP can be preferable for very 
small, highly heterogeneous rows. The PIP gives concrete guidance so users can 
decide when to enable it.

8. **__field_mapping length**. Made explicit: it is **fixed length K**, one 
entry per physical column, with sentinel -1 for an empty column (examples 
updated accordingly). Fixed length keeps position → column deterministic with 
no extra position metadata, and it still compresses well since same-group rows 
share an identical mapping under RLE.

9. **Read-path pruning**. Made explicit: if a query does not reference the MAP 
column, the whole struct — including __field_mapping — is never projected and 
never read (struct-level column pruning). __field_mapping is a mandatory read 
**only** for queries that actually access the MAP.

I also added a short section relating this to PR #7877 (map shredding). The two 
sit close on the same axis (dedicated-per-key vs. reused-across-keys columns) 
and could **converge on a single Struct-based framework** via a config switch, 
so I'd suggest we explore aligning the two efforts rather than maintaining 
parallel stacks.

Thank you again for taking the time on such a careful review — it genuinely 
improved the proposal. I'd be very happy to discuss any of these further, and I 
look forward to your thoughts.

Best regards,

Xinyu Liu

At 2026-06-03 17:23:43, "Jingsong Li" <[email protected]> wrote:
>Hi Xinyu,
>
>Thanks for driving this PIP.
>
>+1 for this!
>
>Left some comments:
>
>  1. Configuration option naming inconsistency
>
>  The "Public Interfaces" section says:
>
>  ▎ <column>.columnar-extend.enabled = true
>
>  But the "Proposed Changes" section uses:
>
>  ▎ <column>.map-storage-layout = 'extend'
>
>  These are two different APIs for enabling the same feature. Which
>one is it? The enum-based map-storage-layout is more extensible, but
>the doc needs to be self-consistent. I'd recommend the enum approach
>(map-storage-layout = extend) and dropping the .enabled flag.
>
>  2. __field_mapping — strings or int IDs?
>
>  The physical layout examples show [usage, load, iowait] (strings),
>but the text says:
>
>  ▎ "The field name <-> id dictionary is stored in file metadata;
>__field_mapping holds only int ids"
>
>  The examples should be corrected to show int IDs (e.g., [0, 1, 2])
>to avoid confusion.
>
>You should describe a specific example, specifically how it is stored
>and how the mapping between actual data and real data is.
>
>  3. Predicate pushdown correctness is under-specified
>
>  The query path says:
>
>  ▎ "From file metadata, look up the dictionary: usage → its physical
>column set S (e.g., {col_0}). Translate the logical predicate usage >
>30 into a physical sub-column predicate over S (col_0 > 30)."
>
>  This is only correct if col_0 always holds "usage" within the entire
>row group. But the design allows different rows to map different
>fields to the same physical column. If col_0 holds "usage" in row 1
>and "rss" in row 5, then a row-group-level min/max stat on col_0 is
>meaningless — it mixes values from different logical fields.
>
>  The doc needs to explicitly address:
>
>  - Within a row group, is a physical column guaranteed to always map
>to the same logical field? If yes, this is a hard constraint the
>writer must enforce (which limits flexibility). If no, predicate
>pushdown can only use the stats as a coarse pre-filter and must verify
>via __field_mapping per row.
>  - How does the "writer column layout optimization" ensure this
>invariant? What happens when it can't (e.g., two rows in the same
>group map different fields to col_0)?
>
>  This is the most critical correctness concern in the design.
>
>  4. Column assignment algorithm is missing
>
>  ▎ "Writer performs column layout optimization while writing:
>consecutive rows with similar field sets keep consistent physical
>column positions, minimizing read amplification."
>
>  This is a one-sentence hand-wave over what is arguably the most
>important algorithmic component. The assignment strategy directly
>impacts:
>
>  - Whether predicate pushdown can work at all (see #3)
>  - Read amplification (how many physical columns you need to read for
>one logical field)
>  - Overflow frequency
>
>  I'd expect the design to specify at least the basic algorithm. For example:
>  - Is it a greedy first-fit per row?
>  - Is there a "dominant mapping" per row group established from the
>first N rows?
>  - What happens when a row's field set conflicts with the established mapping?
>
>  5. Overflow strategy and K sizing
>
>  With K=16 (default) and the doc's own example of 5~50 fields per
>row, many rows will have 34+ fields in overflow. The overflow uses
>MAP<INT, T> — which has the same "no columnar access" problem the
>whole design is trying to solve.
>
>  The doc says "persistent overflow drives K up in later files" but:
>  - What's the adaptation formula? K = max(row_width) from the last
>file? A percentile (p95, p99)?
>  - Is there an upper bound on K? With 50,000 possible fields, K could
>grow unbounded.
>  - For the initial file, the user-configured K=16 may be a bad
>default for "5~50 fields per row" scenarios.
>
>  The doc should provide clearer guidance on K sizing and set
>expectations about overflow rates.
>
>  6. Relationship with existing VARIANT shredding infrastructure
>
>  The Paimon codebase already has a mature VARIANT shredding
>implementation (PaimonShreddingUtils, VariantSchema,
>VariantShreddingWriter, inference infrastructure in
>InferVariantShreddingSchema). There are clear architectural parallels:
>
>  - Both decompose a semi-structured type into typed sub-columns
>  - Both need a mapping/metadata layer to reconstruct the original value
>  - Both integrate with Parquet/ORC reader/writer pipelines
>
>  The PIP should discuss whether code can be shared or patterns
>reused. For example, the inference mechanism
>(InferVariantShreddingSchema) could inform the adaptive K algorithm.
>
>  7. Memory/space overhead for sparse rows
>
>  If K=16 but a row has only 2 fields, 14 physical columns are NULL.
>Plus each row carries a LIST<INT> for __field_mapping. For rows with
>small field counts, this struct-based layout may actually be worse
>than the default MAP storage in terms of space.
>
>  The doc should include a rough analysis of when columnar-extend
>breaks even versus default MAP, so users can make informed decisions
>about when to enable it.
>
>  8. __field_mapping as LIST<INT> — length vs. K
>
>  If __field_mapping has length K (one entry per physical column), a
>missing field in position i could be represented as a sentinel (-1 or
>similar). If it has variable length equal to the number of non-null
>fields, you need a way to know which column each entry maps to.
>
>  The doc's example shows [usage, load, --] where -- seems to mean "no
>field", suggesting fixed-length K. This should be made explicit. A
>fixed-length LIST where each position corresponds to a physical column
>is simpler but less compressible; a variable-length list is more
>compact but requires position metadata.
>
>  9. Read path — missing detail on column pruning
>
>  The query path says:
>
>  ▎ "Issue one read with physical schema {__field_mapping} + S"
>
>  But if __field_mapping is always required for every query (to
>confirm which column holds which field), it becomes a mandatory read
>cost. For queries that touch a single field, this is acceptable. For
>queries that don't touch the MAP at all, does the reader skip
>__field_mapping entirely? This should be explicit.
>
>  ---
>  Minor Issues
>
>  - The __overflow type is shown as MAP<INT, T> (int keys) in the
>struct definition but {steal: 0.3} (string keys) in the example.
>Should be consistent.
>  - The comparison table says "Predicate pushdown: All keys" for
>columnar-extend, but as discussed in #3, this is only true under
>specific column assignment constraints that aren't guaranteed.
>  - The "opt-in" column name <column>.columnar-extend.enabled uses a
>different column property prefix convention than the typical Paimon
>table options — worth aligning with existing conventions.
>
>  ---
>
>Best,
>Jingsong
>
>On Wed, Jun 3, 2026 at 4:29 PM 刘欣瑀 <[email protected]> wrote:
>>
>> Hi everyone,
>>
>> I'd like to start a discussion on a storage optimization for `MAP<STRING, 
>> T>` columns targeting time-series, IoT, observability, and similar scenarios.
>>
>> ### Problem
>>
>> In these workloads, data is **"globally heterogeneous but locally 
>> homogeneous"** — the global key union across all rows can reach tens of 
>> thousands, but each row only carries 5~50 keys that are highly repetitive 
>> within groups (e.g., the same reportor always reports `{usage, load, 
>> iowait}`).
>>
>> Current options all fall short:
>>
>> - **Default MAP storage** (KV arrays): no per-key predicate pushdown, no 
>> per-key column pruning, no per-key statistics.
>>
>> - **VARIANT**: unshredded fields (>90% in these scenarios) fall into a 
>> binary blob, losing all columnar advantages.
>>
>> - **Wide table**: flattening 50,000+ fields into columns results in >99% 
>> NULL, with metadata explosion and unbounded schema churn.
>>
>> ### Proposed Solution: Columnar-Extend
>>
>> We propose an **opt-in storage optimization** for MAP columns — enabled via 
>> a table option:
>>
>> ```sql
>>
>> CREATE TABLE metrics (
>>
>> ts TIMESTAMP,
>>
>> metric STRING,
>>
>> ext-map MAP<STRING, DOUBLE>
>>
>> ) WITH (
>>
>> 'ext-map.map-storage-layout' = 'extend',
>>
>> 'ext-map.columnar-extend.num-columns' = '16'
>>
>> );
>>
>> ```
>>
>> The key idea: instead of storing MAP entries as KV arrays, physically 
>> rewrite them into a **Struct with `K` typed reusable columns** plus a 
>> lightweight `__field_mapping`. This gives every key full columnar treatment 
>> — predicate pushdown, column pruning, native statistics — while keeping the 
>> column count bounded at `K` (tens, not tens of thousands). Rows exceeding 
>> `K` keys spill into a small overflow map, so correctness never depends on 
>> `K` being large enough. `K` adapts across files based on the actual data 
>> width.
>>
>> The logical type stays `MAP<STRING, T>` — the optimization is transparent to 
>> users. Existing queries like `ext-map['usage'] > 30` work unchanged; the 
>> engine translates them into physical sub-column predicates internally.
>>
>> ### PIP Document
>>
>> The full proposal — including physical layout, query path, public interface 
>> changes, and rejected alternatives — is available here: 
>> https://cwiki.apache.org/confluence/display/PAIMON/PIP-43%3A+Columnar-Extend+Storage+Optimization+for+MAP+Type+in+Paimon
>>
>> ### Looking for Feedback
>>
>> I'd appreciate community feedback on:
>>
>> 1. The overall approach — e.g., column count exceeding K with `__overflow` 
>> vs. other strategies.
>>
>> 2. The configuration design (`map-storage-layout` enum, `num-columns`).
>>
>> 3. Any concerns about compatibility.
>>
>> 4. Additional use cases — beyond time-series/IoT/observability, are there 
>> other scenarios in your workloads where MAP columns have high-cardinality, 
>> locally-repetitive keys that would benefit from this optimization?
>>
>> Looking forward to the discussion!
>>
>> Best regards,
>>
>> Xinyu

Re: Columnar-extend PIP43 — review comments addressed

Reply via email to