Re:Re: [DISCUSS] PIP-43 Columnar-Extend Storage Optimization for MAP Type

刘欣瑀 Thu, 04 Jun 2026 20:05:50 -0700

Hi Aitozi,


Thank you, and thanks for the great contribution on PR #7877 — I agree the two 
efforts are very close, both bringing columnar storage to MAP subkeys.
Your suggested unification is exactly the direction I'd hoped for: a single 
`<column>.map-storage-layout` with `'extend'` and `'shredding'` as two modes. 
`'shredding'` binds each physical column to a fixed key (your PR #7877), while 
`'extend'` reuses a fixed `K` columns across keys via `__field_mapping`. The 
two differ mainly in the write-side column-assignment policy, and the physical 
columns can be wrapped in a struct per MAP column for clean namespace 
isolation. Users then pick the mode that fits their workload (stable hot keys 
vs. high-cardinality locally-repetitive patterns).
I'd be very happy to collaborate on converging the two — reusing the work 
already in PR #7877 for the `'shredding'` side. Thanks again for the +1!


Best,
Xinyu Liu


At 2026-06-04 23:41:53, "Aitozi" <[email protected]> wrote:
>Hi xinyu,
>    Thanks for your proposal and for the replies in the PR #7877. This is
>indeed similar to your prosoal. As both try to introduce columnar storage
>for maps subkeys.
>I think these two approaches can be unified by extending
><column>.map-storage-layout = 'extend' / 'shredding', making them
>applicable to different scenarios.
>I support this direction. +1
>
>Best,
>Aitozi.
>
>
>
>Jingsong Li <[email protected]> 于2026年6月4日周四 16:35写道：
>
>>   Hi Xinyu,
>>
>>   Thanks for the detailed design. One clarification on the field
>> dictionary storage:
>>
>>   The PIP says the field name ↔ field id dictionary lives in "file
>> metadata". Does this mean the Parquet file footer (specifically the
>> key_value_metadata in FileMetaData), or a separate
>> sidecar/manifest-level structure?
>>
>>   If it's the Parquet footer, a few follow-up considerations:
>>
>>   1. The dictionary is duplicated in every file. For tables with a
>> large field union (tens of thousands of keys), does the footer size
>> become a concern?
>>   2. When reading, the dictionary must be loaded before any data
>> column can be interpreted — is this already accounted for in the read
>> path design?
>>
>>   Best,
>>   Jingsong
>>
>> On Thu, Jun 4, 2026 at 4:19 PM Jingsong Li <[email protected]> wrote:
>> >
>> > Move to this correct thread, from Xinyu:
>> >
>> > Hi, Jingsong,
>> >
>> > Thank you for the thorough and insightful review — these comments were
>> > extremely helpful and meaningfully sharpened the design. I've updated
>> > the PIP to address every point. A summary per item:
>> >
>> > 1. **Config naming**. Agreed — dropped the .enabled flag entirely and
>> > standardized on the enum <column>.map-storage-layout = extend
>> > throughout. The only additional knob is
>> > <column>.columnar-extend.max-columns (the K_max cap).
>> >
>> > 2. **__field_mapping — int ids**. All examples now use int ids ([0, 1,
>> > 2], with -1 for an empty slot). I added a **worked example** showing
>> > the file-level name ↔ id dictionary and a step-by-step reconstruction
>> > of each row back into MAP<STRING, DOUBLE>, including an overflow row.
>> >
>> > 3. **Predicate pushdown correctness**. This was the most important
>> > point. The design **intentionally imposes no hard invariant** that a
>> > physical column always holds the same logical field within a row group
>> > — different rows may map different fields to the same column. So
>> > pushdown is treated strictly as a **coarse pre-filter**: column
>> > statistics can only skip blocks that provably cannot match, and a
>> > column that happens to mix fields is merely less selective (we read a
>> > few extra rows). The exact answer is always produced by
>> > re-constructing each row via __field_mapping after reading.
>> >
>> > What makes this effective in practice is the **natural locality of the
>> > target workloads**: rows of the same metric write in long contiguous
>> > runs sharing one field pattern, and the allocator pins that pattern to
>> > the same columns for the run. Since pushdown stats are fine-grained
>> > (page-level in Parquet, row-group/stripe-level in ORC), the vast
>> > majority of stat blocks hold a single logical field per physical
>> > column, so their min/max stay tight and page / row-group pruning keeps
>> > a high filtering ratio.
>> >
>> > 4. **Column assignment algorithm**. Added the streaming per-row
>> > allocator (ExtendColumnAllocator), which maintains an in-memory column
>> > → owning field state across rows: Hit (a reused field keeps its
>> > column), Evict (a new field takes a free column, else the LRU column
>> > is evicted), Retain (untouched columns keep their owner, so stable
>> > groups stay stable), Overflow (extras beyond K spill). To your
>> > specific questions: it is not a per-row first-fit, and there is no
>> > frozen "dominant mapping" — the state evolves continuously; when a
>> > row's field set conflicts the allocator simply re-pins, and
>> > correctness never depends on conflict-freedom (it's backed by
>> > __field_mapping).
>> >
>> > 5. **K sizing & overflow.** Specified explicitly. K_next =
>> > min(P99_row_width(recent files), K_max), adapting in both directions
>> > across files; K_max (default 256) bounds growth so a key explosion
>> > can't create unbounded columns. There is no fixed default like 16 —
>> > the first file simply starts at K_max (no prior files to adapt from),
>> > and an over-wide first file only affects that one file before
>> > adaptation converges. So overflow only catches long-tail rows in
>> > steady state.
>> >
>> > 6. **VARIANT shredding.** Added a dedicated section. The inference /
>> > reconstruct / plumbing layers (VariantShreddingWriter's ShreddedResult
>> > builder, PaimonShreddedRow + RowToColumnConverter, and ShreddingUtils
>> > / VariantUtils) are good candidates to refactor and reuse. The genuine
>> > difference is the **write-path column layout** — shredding binds one
>> > fixed column per field plus a blob, whereas extend reuses K columns
>> > per-row with a typed overflow. The PIP proposes generalizing the
>> > shared layers rather than building a parallel stack.
>> >
>> > 7. **Sparse-row space overhead.** Added a break-even analysis:
>> > columnar-extend pays off when rows are grouped and locally homogeneous
>> > and each row's field count is a meaningful fraction of K; default MAP
>> > can be preferable for very small, highly heterogeneous rows. The PIP
>> > gives concrete guidance so users can decide when to enable it.
>> >
>> > 8. **__field_mapping length**. Made explicit: it is **fixed length
>> > K**, one entry per physical column, with sentinel -1 for an empty
>> > column (examples updated accordingly). Fixed length keeps position →
>> > column deterministic with no extra position metadata, and it still
>> > compresses well since same-group rows share an identical mapping under
>> > RLE.
>> >
>> > 9. **Read-path pruning**. Made explicit: if a query does not reference
>> > the MAP column, the whole struct — including __field_mapping — is
>> > never projected and never read (struct-level column pruning).
>> > __field_mapping is a mandatory read **only** for queries that actually
>> > access the MAP.
>> >
>> > I also added a short section relating this to PR #7877 (map
>> > shredding). The two sit close on the same axis (dedicated-per-key vs.
>> > reused-across-keys columns) and could **converge on a single
>> > Struct-based framework** via a config switch, so I'd suggest we
>> > explore aligning the two efforts rather than maintaining parallel
>> > stacks.
>> >
>> > Thank you again for taking the time on such a careful review — it
>> > genuinely improved the proposal. I'd be very happy to discuss any of
>> > these further, and I look forward to your thoughts.
>> >
>> > Best regards,
>> >
>> > Xinyu Liu
>> >
>> > On Wed, Jun 3, 2026 at 5:23 PM Jingsong Li <[email protected]>
>> wrote:
>> > >
>> > > Hi Xinyu,
>> > >
>> > > Thanks for driving this PIP.
>> > >
>> > > +1 for this!
>> > >
>> > > Left some comments:
>> > >
>> > >   1. Configuration option naming inconsistency
>> > >
>> > >   The "Public Interfaces" section says:
>> > >
>> > >   ▎ <column>.columnar-extend.enabled = true
>> > >
>> > >   But the "Proposed Changes" section uses:
>> > >
>> > >   ▎ <column>.map-storage-layout = 'extend'
>> > >
>> > >   These are two different APIs for enabling the same feature. Which
>> > > one is it? The enum-based map-storage-layout is more extensible, but
>> > > the doc needs to be self-consistent. I'd recommend the enum approach
>> > > (map-storage-layout = extend) and dropping the .enabled flag.
>> > >
>> > >   2. __field_mapping — strings or int IDs?
>> > >
>> > >   The physical layout examples show [usage, load, iowait] (strings),
>> > > but the text says:
>> > >
>> > >   ▎ "The field name <-> id dictionary is stored in file metadata;
>> > > __field_mapping holds only int ids"
>> > >
>> > >   The examples should be corrected to show int IDs (e.g., [0, 1, 2])
>> > > to avoid confusion.
>> > >
>> > > You should describe a specific example, specifically how it is stored
>> > > and how the mapping between actual data and real data is.
>> > >
>> > >   3. Predicate pushdown correctness is under-specified
>> > >
>> > >   The query path says:
>> > >
>> > >   ▎ "From file metadata, look up the dictionary: usage → its physical
>> > > column set S (e.g., {col_0}). Translate the logical predicate usage >
>> > > 30 into a physical sub-column predicate over S (col_0 > 30)."
>> > >
>> > >   This is only correct if col_0 always holds "usage" within the entire
>> > > row group. But the design allows different rows to map different
>> > > fields to the same physical column. If col_0 holds "usage" in row 1
>> > > and "rss" in row 5, then a row-group-level min/max stat on col_0 is
>> > > meaningless — it mixes values from different logical fields.
>> > >
>> > >   The doc needs to explicitly address:
>> > >
>> > >   - Within a row group, is a physical column guaranteed to always map
>> > > to the same logical field? If yes, this is a hard constraint the
>> > > writer must enforce (which limits flexibility). If no, predicate
>> > > pushdown can only use the stats as a coarse pre-filter and must verify
>> > > via __field_mapping per row.
>> > >   - How does the "writer column layout optimization" ensure this
>> > > invariant? What happens when it can't (e.g., two rows in the same
>> > > group map different fields to col_0)?
>> > >
>> > >   This is the most critical correctness concern in the design.
>> > >
>> > >   4. Column assignment algorithm is missing
>> > >
>> > >   ▎ "Writer performs column layout optimization while writing:
>> > > consecutive rows with similar field sets keep consistent physical
>> > > column positions, minimizing read amplification."
>> > >
>> > >   This is a one-sentence hand-wave over what is arguably the most
>> > > important algorithmic component. The assignment strategy directly
>> > > impacts:
>> > >
>> > >   - Whether predicate pushdown can work at all (see #3)
>> > >   - Read amplification (how many physical columns you need to read for
>> > > one logical field)
>> > >   - Overflow frequency
>> > >
>> > >   I'd expect the design to specify at least the basic algorithm. For
>> example:
>> > >   - Is it a greedy first-fit per row?
>> > >   - Is there a "dominant mapping" per row group established from the
>> > > first N rows?
>> > >   - What happens when a row's field set conflicts with the established
>> mapping?
>> > >
>> > >   5. Overflow strategy and K sizing
>> > >
>> > >   With K=16 (default) and the doc's own example of 5~50 fields per
>> > > row, many rows will have 34+ fields in overflow. The overflow uses
>> > > MAP<INT, T> — which has the same "no columnar access" problem the
>> > > whole design is trying to solve.
>> > >
>> > >   The doc says "persistent overflow drives K up in later files" but:
>> > >   - What's the adaptation formula? K = max(row_width) from the last
>> > > file? A percentile (p95, p99)?
>> > >   - Is there an upper bound on K? With 50,000 possible fields, K could
>> > > grow unbounded.
>> > >   - For the initial file, the user-configured K=16 may be a bad
>> > > default for "5~50 fields per row" scenarios.
>> > >
>> > >   The doc should provide clearer guidance on K sizing and set
>> > > expectations about overflow rates.
>> > >
>> > >   6. Relationship with existing VARIANT shredding infrastructure
>> > >
>> > >   The Paimon codebase already has a mature VARIANT shredding
>> > > implementation (PaimonShreddingUtils, VariantSchema,
>> > > VariantShreddingWriter, inference infrastructure in
>> > > InferVariantShreddingSchema). There are clear architectural parallels:
>> > >
>> > >   - Both decompose a semi-structured type into typed sub-columns
>> > >   - Both need a mapping/metadata layer to reconstruct the original
>> value
>> > >   - Both integrate with Parquet/ORC reader/writer pipelines
>> > >
>> > >   The PIP should discuss whether code can be shared or patterns
>> > > reused. For example, the inference mechanism
>> > > (InferVariantShreddingSchema) could inform the adaptive K algorithm.
>> > >
>> > >   7. Memory/space overhead for sparse rows
>> > >
>> > >   If K=16 but a row has only 2 fields, 14 physical columns are NULL.
>> > > Plus each row carries a LIST<INT> for __field_mapping. For rows with
>> > > small field counts, this struct-based layout may actually be worse
>> > > than the default MAP storage in terms of space.
>> > >
>> > >   The doc should include a rough analysis of when columnar-extend
>> > > breaks even versus default MAP, so users can make informed decisions
>> > > about when to enable it.
>> > >
>> > >   8. __field_mapping as LIST<INT> — length vs. K
>> > >
>> > >   If __field_mapping has length K (one entry per physical column), a
>> > > missing field in position i could be represented as a sentinel (-1 or
>> > > similar). If it has variable length equal to the number of non-null
>> > > fields, you need a way to know which column each entry maps to.
>> > >
>> > >   The doc's example shows [usage, load, --] where -- seems to mean "no
>> > > field", suggesting fixed-length K. This should be made explicit. A
>> > > fixed-length LIST where each position corresponds to a physical column
>> > > is simpler but less compressible; a variable-length list is more
>> > > compact but requires position metadata.
>> > >
>> > >   9. Read path — missing detail on column pruning
>> > >
>> > >   The query path says:
>> > >
>> > >   ▎ "Issue one read with physical schema {__field_mapping} + S"
>> > >
>> > >   But if __field_mapping is always required for every query (to
>> > > confirm which column holds which field), it becomes a mandatory read
>> > > cost. For queries that touch a single field, this is acceptable. For
>> > > queries that don't touch the MAP at all, does the reader skip
>> > > __field_mapping entirely? This should be explicit.
>> > >
>> > >   ---
>> > >   Minor Issues
>> > >
>> > >   - The __overflow type is shown as MAP<INT, T> (int keys) in the
>> > > struct definition but {steal: 0.3} (string keys) in the example.
>> > > Should be consistent.
>> > >   - The comparison table says "Predicate pushdown: All keys" for
>> > > columnar-extend, but as discussed in #3, this is only true under
>> > > specific column assignment constraints that aren't guaranteed.
>> > >   - The "opt-in" column name <column>.columnar-extend.enabled uses a
>> > > different column property prefix convention than the typical Paimon
>> > > table options — worth aligning with existing conventions.
>> > >
>> > >   ---
>> > >
>> > > Best,
>> > > Jingsong
>> > >
>> > > On Wed, Jun 3, 2026 at 4:29 PM 刘欣瑀 <[email protected]> wrote:
>> > > >
>> > > > Hi everyone,
>> > > >
>> > > > I'd like to start a discussion on a storage optimization for
>> `MAP<STRING, T>` columns targeting time-series, IoT, observability, and
>> similar scenarios.
>> > > >
>> > > > ### Problem
>> > > >
>> > > > In these workloads, data is **"globally heterogeneous but locally
>> homogeneous"** — the global key union across all rows can reach tens of
>> thousands, but each row only carries 5~50 keys that are highly repetitive
>> within groups (e.g., the same reportor always reports `{usage, load,
>> iowait}`).
>> > > >
>> > > > Current options all fall short:
>> > > >
>> > > > - **Default MAP storage** (KV arrays): no per-key predicate
>> pushdown, no per-key column pruning, no per-key statistics.
>> > > >
>> > > > - **VARIANT**: unshredded fields (>90% in these scenarios) fall into
>> a binary blob, losing all columnar advantages.
>> > > >
>> > > > - **Wide table**: flattening 50,000+ fields into columns results in
>> >99% NULL, with metadata explosion and unbounded schema churn.
>> > > >
>> > > > ### Proposed Solution: Columnar-Extend
>> > > >
>> > > > We propose an **opt-in storage optimization** for MAP columns —
>> enabled via a table option:
>> > > >
>> > > > ```sql
>> > > >
>> > > > CREATE TABLE metrics (
>> > > >
>> > > > ts TIMESTAMP,
>> > > >
>> > > > metric STRING,
>> > > >
>> > > > ext-map MAP<STRING, DOUBLE>
>> > > >
>> > > > ) WITH (
>> > > >
>> > > > 'ext-map.map-storage-layout' = 'extend',
>> > > >
>> > > > 'ext-map.columnar-extend.num-columns' = '16'
>> > > >
>> > > > );
>> > > >
>> > > > ```
>> > > >
>> > > > The key idea: instead of storing MAP entries as KV arrays,
>> physically rewrite them into a **Struct with `K` typed reusable columns**
>> plus a lightweight `__field_mapping`. This gives every key full columnar
>> treatment — predicate pushdown, column pruning, native statistics — while
>> keeping the column count bounded at `K` (tens, not tens of thousands). Rows
>> exceeding `K` keys spill into a small overflow map, so correctness never
>> depends on `K` being large enough. `K` adapts across files based on the
>> actual data width.
>> > > >
>> > > > The logical type stays `MAP<STRING, T>` — the optimization is
>> transparent to users. Existing queries like `ext-map['usage'] > 30` work
>> unchanged; the engine translates them into physical sub-column predicates
>> internally.
>> > > >
>> > > > ### PIP Document
>> > > >
>> > > > The full proposal — including physical layout, query path, public
>> interface changes, and rejected alternatives — is available here:
>> https://cwiki.apache.org/confluence/display/PAIMON/PIP-43%3A+Columnar-Extend+Storage+Optimization+for+MAP+Type+in+Paimon
>> > > >
>> > > > ### Looking for Feedback
>> > > >
>> > > > I'd appreciate community feedback on:
>> > > >
>> > > > 1. The overall approach — e.g., column count exceeding K with
>> `__overflow` vs. other strategies.
>> > > >
>> > > > 2. The configuration design (`map-storage-layout` enum,
>> `num-columns`).
>> > > >
>> > > > 3. Any concerns about compatibility.
>> > > >
>> > > > 4. Additional use cases — beyond time-series/IoT/observability, are
>> there other scenarios in your workloads where MAP columns have
>> high-cardinality, locally-repetitive keys that would benefit from this
>> optimization?
>> > > >
>> > > > Looking forward to the discussion!
>> > > >
>> > > > Best regards,
>> > > >
>> > > > Xinyu
>>

Re:Re: [DISCUSS] PIP-43 Columnar-Extend Storage Optimization for MAP Type

Reply via email to