Re:Re: [DISCUSS] PIP-43 Columnar-Extend Storage Optimization for MAP Type

刘欣瑀 Thu, 04 Jun 2026 19:52:07 -0700

Hi Jingsong,

Great question — yes, the dictionary lives in the **file footer**, and we 
deliberately chose this over a sidecar/manifest structure so that each file 
stays **self-describing** (a file can be read and fully reconstructed without 
consulting any external state). Concretely:


**ORC**: stored in the file footer metadata, which ORC already compresses with 
zstd.
**Parquet**: Parquet does not compress the footer for us, so we **compress the 
dictionary ourselves** (e.g., zstd) and store the compressed bytes in the 
footer.

On your two follow-ups:

1. **Footer size / duplication across files.** This was our first concern too, 
so we measured it on real production data. The duplication is real, but it is 
per-file and only for the keys present in that file. Field names compress 
extremely well with zstd, and the dictionary is tiny relative to the data it 
describes. As a concrete data point: our **largest production file carries a 
dictionary of roughly 200 KB against a 2 GB data file**. In general the 
dictionary is on the order of **1/1,000 to 1/10,000 of the data file size**. So 
the per-file duplication is real but negligible in practice.

2. **Load order on the read path.** The footer (and therefore the dictionary) 
is always read **before** any data column — this is the same ordering 
Parquet/ORC already require for the schema and column statistics, so it adds no 
extra round trip; the dictionary simply rides along with the footer read that 
the reader performs anyway. Decompression and in-memory expansion are cheap as 
well: even for a large field union (~1000 metrics × ~50 fields each at ≈50 B 
per name), the **fully expanded in-memory dictionary is only a few MB**, and 
the compressed on-disk form is just **tens to a few hundred KB**. Neither the 
I/O nor the expansion has a measurable impact on read latency.

So the design keeps files self-contained at a footer cost that our measurements 
show is negligible, and the read path already loads the dictionary at the 
natural point (with the footer) before interpreting any data column.

Happy to add the footer-storage details explicitly to the PIP if that would be 
useful — let me know.

Best regards,

Xinyu Liu




At 2026-06-04 16:33:21, "Jingsong Li" <[email protected]> wrote:
>  Hi Xinyu,
>
>  Thanks for the detailed design. One clarification on the field
>dictionary storage:
>
>  The PIP says the field name ↔ field id dictionary lives in "file
>metadata". Does this mean the Parquet file footer (specifically the
>key_value_metadata in FileMetaData), or a separate
>sidecar/manifest-level structure?
>
>  If it's the Parquet footer, a few follow-up considerations:
>
>  1. The dictionary is duplicated in every file. For tables with a
>large field union (tens of thousands of keys), does the footer size
>become a concern?
>  2. When reading, the dictionary must be loaded before any data
>column can be interpreted — is this already accounted for in the read
>path design?
>
>  Best,
>  Jingsong
>
>On Thu, Jun 4, 2026 at 4:19 PM Jingsong Li <[email protected]> wrote:
>>
>> Move to this correct thread, from Xinyu:
>>
>> Hi, Jingsong,
>>
>> Thank you for the thorough and insightful review — these comments were
>> extremely helpful and meaningfully sharpened the design. I've updated
>> the PIP to address every point. A summary per item:
>>
>> 1. **Config naming**. Agreed — dropped the .enabled flag entirely and
>> standardized on the enum <column>.map-storage-layout = extend
>> throughout. The only additional knob is
>> <column>.columnar-extend.max-columns (the K_max cap).
>>
>> 2. **__field_mapping — int ids**. All examples now use int ids ([0, 1,
>> 2], with -1 for an empty slot). I added a **worked example** showing
>> the file-level name ↔ id dictionary and a step-by-step reconstruction
>> of each row back into MAP<STRING, DOUBLE>, including an overflow row.
>>
>> 3. **Predicate pushdown correctness**. This was the most important
>> point. The design **intentionally imposes no hard invariant** that a
>> physical column always holds the same logical field within a row group
>> — different rows may map different fields to the same column. So
>> pushdown is treated strictly as a **coarse pre-filter**: column
>> statistics can only skip blocks that provably cannot match, and a
>> column that happens to mix fields is merely less selective (we read a
>> few extra rows). The exact answer is always produced by
>> re-constructing each row via __field_mapping after reading.
>>
>> What makes this effective in practice is the **natural locality of the
>> target workloads**: rows of the same metric write in long contiguous
>> runs sharing one field pattern, and the allocator pins that pattern to
>> the same columns for the run. Since pushdown stats are fine-grained
>> (page-level in Parquet, row-group/stripe-level in ORC), the vast
>> majority of stat blocks hold a single logical field per physical
>> column, so their min/max stay tight and page / row-group pruning keeps
>> a high filtering ratio.
>>
>> 4. **Column assignment algorithm**. Added the streaming per-row
>> allocator (ExtendColumnAllocator), which maintains an in-memory column
>> → owning field state across rows: Hit (a reused field keeps its
>> column), Evict (a new field takes a free column, else the LRU column
>> is evicted), Retain (untouched columns keep their owner, so stable
>> groups stay stable), Overflow (extras beyond K spill). To your
>> specific questions: it is not a per-row first-fit, and there is no
>> frozen "dominant mapping" — the state evolves continuously; when a
>> row's field set conflicts the allocator simply re-pins, and
>> correctness never depends on conflict-freedom (it's backed by
>> __field_mapping).
>>
>> 5. **K sizing & overflow.** Specified explicitly. K_next =
>> min(P99_row_width(recent files), K_max), adapting in both directions
>> across files; K_max (default 256) bounds growth so a key explosion
>> can't create unbounded columns. There is no fixed default like 16 —
>> the first file simply starts at K_max (no prior files to adapt from),
>> and an over-wide first file only affects that one file before
>> adaptation converges. So overflow only catches long-tail rows in
>> steady state.
>>
>> 6. **VARIANT shredding.** Added a dedicated section. The inference /
>> reconstruct / plumbing layers (VariantShreddingWriter's ShreddedResult
>> builder, PaimonShreddedRow + RowToColumnConverter, and ShreddingUtils
>> / VariantUtils) are good candidates to refactor and reuse. The genuine
>> difference is the **write-path column layout** — shredding binds one
>> fixed column per field plus a blob, whereas extend reuses K columns
>> per-row with a typed overflow. The PIP proposes generalizing the
>> shared layers rather than building a parallel stack.
>>
>> 7. **Sparse-row space overhead.** Added a break-even analysis:
>> columnar-extend pays off when rows are grouped and locally homogeneous
>> and each row's field count is a meaningful fraction of K; default MAP
>> can be preferable for very small, highly heterogeneous rows. The PIP
>> gives concrete guidance so users can decide when to enable it.
>>
>> 8. **__field_mapping length**. Made explicit: it is **fixed length
>> K**, one entry per physical column, with sentinel -1 for an empty
>> column (examples updated accordingly). Fixed length keeps position →
>> column deterministic with no extra position metadata, and it still
>> compresses well since same-group rows share an identical mapping under
>> RLE.
>>
>> 9. **Read-path pruning**. Made explicit: if a query does not reference
>> the MAP column, the whole struct — including __field_mapping — is
>> never projected and never read (struct-level column pruning).
>> __field_mapping is a mandatory read **only** for queries that actually
>> access the MAP.
>>
>> I also added a short section relating this to PR #7877 (map
>> shredding). The two sit close on the same axis (dedicated-per-key vs.
>> reused-across-keys columns) and could **converge on a single
>> Struct-based framework** via a config switch, so I'd suggest we
>> explore aligning the two efforts rather than maintaining parallel
>> stacks.
>>
>> Thank you again for taking the time on such a careful review — it
>> genuinely improved the proposal. I'd be very happy to discuss any of
>> these further, and I look forward to your thoughts.
>>
>> Best regards,
>>
>> Xinyu Liu
>>
>> On Wed, Jun 3, 2026 at 5:23 PM Jingsong Li <[email protected]> wrote:
>> >
>> > Hi Xinyu,
>> >
>> > Thanks for driving this PIP.
>> >
>> > +1 for this!
>> >
>> > Left some comments:
>> >
>> >   1. Configuration option naming inconsistency
>> >
>> >   The "Public Interfaces" section says:
>> >
>> >   ▎ <column>.columnar-extend.enabled = true
>> >
>> >   But the "Proposed Changes" section uses:
>> >
>> >   ▎ <column>.map-storage-layout = 'extend'
>> >
>> >   These are two different APIs for enabling the same feature. Which
>> > one is it? The enum-based map-storage-layout is more extensible, but
>> > the doc needs to be self-consistent. I'd recommend the enum approach
>> > (map-storage-layout = extend) and dropping the .enabled flag.
>> >
>> >   2. __field_mapping — strings or int IDs?
>> >
>> >   The physical layout examples show [usage, load, iowait] (strings),
>> > but the text says:
>> >
>> >   ▎ "The field name <-> id dictionary is stored in file metadata;
>> > __field_mapping holds only int ids"
>> >
>> >   The examples should be corrected to show int IDs (e.g., [0, 1, 2])
>> > to avoid confusion.
>> >
>> > You should describe a specific example, specifically how it is stored
>> > and how the mapping between actual data and real data is.
>> >
>> >   3. Predicate pushdown correctness is under-specified
>> >
>> >   The query path says:
>> >
>> >   ▎ "From file metadata, look up the dictionary: usage → its physical
>> > column set S (e.g., {col_0}). Translate the logical predicate usage >
>> > 30 into a physical sub-column predicate over S (col_0 > 30)."
>> >
>> >   This is only correct if col_0 always holds "usage" within the entire
>> > row group. But the design allows different rows to map different
>> > fields to the same physical column. If col_0 holds "usage" in row 1
>> > and "rss" in row 5, then a row-group-level min/max stat on col_0 is
>> > meaningless — it mixes values from different logical fields.
>> >
>> >   The doc needs to explicitly address:
>> >
>> >   - Within a row group, is a physical column guaranteed to always map
>> > to the same logical field? If yes, this is a hard constraint the
>> > writer must enforce (which limits flexibility). If no, predicate
>> > pushdown can only use the stats as a coarse pre-filter and must verify
>> > via __field_mapping per row.
>> >   - How does the "writer column layout optimization" ensure this
>> > invariant? What happens when it can't (e.g., two rows in the same
>> > group map different fields to col_0)?
>> >
>> >   This is the most critical correctness concern in the design.
>> >
>> >   4. Column assignment algorithm is missing
>> >
>> >   ▎ "Writer performs column layout optimization while writing:
>> > consecutive rows with similar field sets keep consistent physical
>> > column positions, minimizing read amplification."
>> >
>> >   This is a one-sentence hand-wave over what is arguably the most
>> > important algorithmic component. The assignment strategy directly
>> > impacts:
>> >
>> >   - Whether predicate pushdown can work at all (see #3)
>> >   - Read amplification (how many physical columns you need to read for
>> > one logical field)
>> >   - Overflow frequency
>> >
>> >   I'd expect the design to specify at least the basic algorithm. For 
>> > example:
>> >   - Is it a greedy first-fit per row?
>> >   - Is there a "dominant mapping" per row group established from the
>> > first N rows?
>> >   - What happens when a row's field set conflicts with the established 
>> > mapping?
>> >
>> >   5. Overflow strategy and K sizing
>> >
>> >   With K=16 (default) and the doc's own example of 5~50 fields per
>> > row, many rows will have 34+ fields in overflow. The overflow uses
>> > MAP<INT, T> — which has the same "no columnar access" problem the
>> > whole design is trying to solve.
>> >
>> >   The doc says "persistent overflow drives K up in later files" but:
>> >   - What's the adaptation formula? K = max(row_width) from the last
>> > file? A percentile (p95, p99)?
>> >   - Is there an upper bound on K? With 50,000 possible fields, K could
>> > grow unbounded.
>> >   - For the initial file, the user-configured K=16 may be a bad
>> > default for "5~50 fields per row" scenarios.
>> >
>> >   The doc should provide clearer guidance on K sizing and set
>> > expectations about overflow rates.
>> >
>> >   6. Relationship with existing VARIANT shredding infrastructure
>> >
>> >   The Paimon codebase already has a mature VARIANT shredding
>> > implementation (PaimonShreddingUtils, VariantSchema,
>> > VariantShreddingWriter, inference infrastructure in
>> > InferVariantShreddingSchema). There are clear architectural parallels:
>> >
>> >   - Both decompose a semi-structured type into typed sub-columns
>> >   - Both need a mapping/metadata layer to reconstruct the original value
>> >   - Both integrate with Parquet/ORC reader/writer pipelines
>> >
>> >   The PIP should discuss whether code can be shared or patterns
>> > reused. For example, the inference mechanism
>> > (InferVariantShreddingSchema) could inform the adaptive K algorithm.
>> >
>> >   7. Memory/space overhead for sparse rows
>> >
>> >   If K=16 but a row has only 2 fields, 14 physical columns are NULL.
>> > Plus each row carries a LIST<INT> for __field_mapping. For rows with
>> > small field counts, this struct-based layout may actually be worse
>> > than the default MAP storage in terms of space.
>> >
>> >   The doc should include a rough analysis of when columnar-extend
>> > breaks even versus default MAP, so users can make informed decisions
>> > about when to enable it.
>> >
>> >   8. __field_mapping as LIST<INT> — length vs. K
>> >
>> >   If __field_mapping has length K (one entry per physical column), a
>> > missing field in position i could be represented as a sentinel (-1 or
>> > similar). If it has variable length equal to the number of non-null
>> > fields, you need a way to know which column each entry maps to.
>> >
>> >   The doc's example shows [usage, load, --] where -- seems to mean "no
>> > field", suggesting fixed-length K. This should be made explicit. A
>> > fixed-length LIST where each position corresponds to a physical column
>> > is simpler but less compressible; a variable-length list is more
>> > compact but requires position metadata.
>> >
>> >   9. Read path — missing detail on column pruning
>> >
>> >   The query path says:
>> >
>> >   ▎ "Issue one read with physical schema {__field_mapping} + S"
>> >
>> >   But if __field_mapping is always required for every query (to
>> > confirm which column holds which field), it becomes a mandatory read
>> > cost. For queries that touch a single field, this is acceptable. For
>> > queries that don't touch the MAP at all, does the reader skip
>> > __field_mapping entirely? This should be explicit.
>> >
>> >   ---
>> >   Minor Issues
>> >
>> >   - The __overflow type is shown as MAP<INT, T> (int keys) in the
>> > struct definition but {steal: 0.3} (string keys) in the example.
>> > Should be consistent.
>> >   - The comparison table says "Predicate pushdown: All keys" for
>> > columnar-extend, but as discussed in #3, this is only true under
>> > specific column assignment constraints that aren't guaranteed.
>> >   - The "opt-in" column name <column>.columnar-extend.enabled uses a
>> > different column property prefix convention than the typical Paimon
>> > table options — worth aligning with existing conventions.
>> >
>> >   ---
>> >
>> > Best,
>> > Jingsong
>> >
>> > On Wed, Jun 3, 2026 at 4:29 PM 刘欣瑀 <[email protected]> wrote:
>> > >
>> > > Hi everyone,
>> > >
>> > > I'd like to start a discussion on a storage optimization for 
>> > > `MAP<STRING, T>` columns targeting time-series, IoT, observability, and 
>> > > similar scenarios.
>> > >
>> > > ### Problem
>> > >
>> > > In these workloads, data is **"globally heterogeneous but locally 
>> > > homogeneous"** — the global key union across all rows can reach tens of 
>> > > thousands, but each row only carries 5~50 keys that are highly 
>> > > repetitive within groups (e.g., the same reportor always reports 
>> > > `{usage, load, iowait}`).
>> > >
>> > > Current options all fall short:
>> > >
>> > > - **Default MAP storage** (KV arrays): no per-key predicate pushdown, no 
>> > > per-key column pruning, no per-key statistics.
>> > >
>> > > - **VARIANT**: unshredded fields (>90% in these scenarios) fall into a 
>> > > binary blob, losing all columnar advantages.
>> > >
>> > > - **Wide table**: flattening 50,000+ fields into columns results in >99% 
>> > > NULL, with metadata explosion and unbounded schema churn.
>> > >
>> > > ### Proposed Solution: Columnar-Extend
>> > >
>> > > We propose an **opt-in storage optimization** for MAP columns — enabled 
>> > > via a table option:
>> > >
>> > > ```sql
>> > >
>> > > CREATE TABLE metrics (
>> > >
>> > > ts TIMESTAMP,
>> > >
>> > > metric STRING,
>> > >
>> > > ext-map MAP<STRING, DOUBLE>
>> > >
>> > > ) WITH (
>> > >
>> > > 'ext-map.map-storage-layout' = 'extend',
>> > >
>> > > 'ext-map.columnar-extend.num-columns' = '16'
>> > >
>> > > );
>> > >
>> > > ```
>> > >
>> > > The key idea: instead of storing MAP entries as KV arrays, physically 
>> > > rewrite them into a **Struct with `K` typed reusable columns** plus a 
>> > > lightweight `__field_mapping`. This gives every key full columnar 
>> > > treatment — predicate pushdown, column pruning, native statistics — 
>> > > while keeping the column count bounded at `K` (tens, not tens of 
>> > > thousands). Rows exceeding `K` keys spill into a small overflow map, so 
>> > > correctness never depends on `K` being large enough. `K` adapts across 
>> > > files based on the actual data width.
>> > >
>> > > The logical type stays `MAP<STRING, T>` — the optimization is 
>> > > transparent to users. Existing queries like `ext-map['usage'] > 30` work 
>> > > unchanged; the engine translates them into physical sub-column 
>> > > predicates internally.
>> > >
>> > > ### PIP Document
>> > >
>> > > The full proposal — including physical layout, query path, public 
>> > > interface changes, and rejected alternatives — is available here: 
>> > > https://cwiki.apache.org/confluence/display/PAIMON/PIP-43%3A+Columnar-Extend+Storage+Optimization+for+MAP+Type+in+Paimon
>> > >
>> > > ### Looking for Feedback
>> > >
>> > > I'd appreciate community feedback on:
>> > >
>> > > 1. The overall approach — e.g., column count exceeding K with 
>> > > `__overflow` vs. other strategies.
>> > >
>> > > 2. The configuration design (`map-storage-layout` enum, `num-columns`).
>> > >
>> > > 3. Any concerns about compatibility.
>> > >
>> > > 4. Additional use cases — beyond time-series/IoT/observability, are 
>> > > there other scenarios in your workloads where MAP columns have 
>> > > high-cardinality, locally-repetitive keys that would benefit from this 
>> > > optimization?
>> > >
>> > > Looking forward to the discussion!
>> > >
>> > > Best regards,
>> > >
>> > > Xinyu

Re:Re: [DISCUSS] PIP-43 Columnar-Extend Storage Optimization for MAP Type

Reply via email to