Thanks Xinyu for your feedback. +1 for this.
Best, Jingsong On Fri, Jun 5, 2026 at 11:07 AM 刘欣瑀 <[email protected]> wrote: > > Hi Aitozi, > > > Thank you, and thanks for the great contribution on PR #7877 — I agree the > two efforts are very close, both bringing columnar storage to MAP subkeys. > Your suggested unification is exactly the direction I'd hoped for: a single > `<column>.map-storage-layout` with `'extend'` and `'shredding'` as two modes. > `'shredding'` binds each physical column to a fixed key (your PR #7877), > while `'extend'` reuses a fixed `K` columns across keys via > `__field_mapping`. The two differ mainly in the write-side column-assignment > policy, and the physical columns can be wrapped in a struct per MAP column > for clean namespace isolation. Users then pick the mode that fits their > workload (stable hot keys vs. high-cardinality locally-repetitive patterns). > I'd be very happy to collaborate on converging the two — reusing the work > already in PR #7877 for the `'shredding'` side. Thanks again for the +1! > > > Best, > Xinyu Liu > > > At 2026-06-04 23:41:53, "Aitozi" <[email protected]> wrote: > >Hi xinyu, > > Thanks for your proposal and for the replies in the PR #7877. This is > >indeed similar to your prosoal. As both try to introduce columnar storage > >for maps subkeys. > >I think these two approaches can be unified by extending > ><column>.map-storage-layout = 'extend' / 'shredding', making them > >applicable to different scenarios. > >I support this direction. +1 > > > >Best, > >Aitozi. > > > > > > > >Jingsong Li <[email protected]> 于2026年6月4日周四 16:35写道: > > > >> Hi Xinyu, > >> > >> Thanks for the detailed design. One clarification on the field > >> dictionary storage: > >> > >> The PIP says the field name ↔ field id dictionary lives in "file > >> metadata". Does this mean the Parquet file footer (specifically the > >> key_value_metadata in FileMetaData), or a separate > >> sidecar/manifest-level structure? > >> > >> If it's the Parquet footer, a few follow-up considerations: > >> > >> 1. The dictionary is duplicated in every file. For tables with a > >> large field union (tens of thousands of keys), does the footer size > >> become a concern? > >> 2. When reading, the dictionary must be loaded before any data > >> column can be interpreted — is this already accounted for in the read > >> path design? > >> > >> Best, > >> Jingsong > >> > >> On Thu, Jun 4, 2026 at 4:19 PM Jingsong Li <[email protected]> wrote: > >> > > >> > Move to this correct thread, from Xinyu: > >> > > >> > Hi, Jingsong, > >> > > >> > Thank you for the thorough and insightful review — these comments were > >> > extremely helpful and meaningfully sharpened the design. I've updated > >> > the PIP to address every point. A summary per item: > >> > > >> > 1. **Config naming**. Agreed — dropped the .enabled flag entirely and > >> > standardized on the enum <column>.map-storage-layout = extend > >> > throughout. The only additional knob is > >> > <column>.columnar-extend.max-columns (the K_max cap). > >> > > >> > 2. **__field_mapping — int ids**. All examples now use int ids ([0, 1, > >> > 2], with -1 for an empty slot). I added a **worked example** showing > >> > the file-level name ↔ id dictionary and a step-by-step reconstruction > >> > of each row back into MAP<STRING, DOUBLE>, including an overflow row. > >> > > >> > 3. **Predicate pushdown correctness**. This was the most important > >> > point. The design **intentionally imposes no hard invariant** that a > >> > physical column always holds the same logical field within a row group > >> > — different rows may map different fields to the same column. So > >> > pushdown is treated strictly as a **coarse pre-filter**: column > >> > statistics can only skip blocks that provably cannot match, and a > >> > column that happens to mix fields is merely less selective (we read a > >> > few extra rows). The exact answer is always produced by > >> > re-constructing each row via __field_mapping after reading. > >> > > >> > What makes this effective in practice is the **natural locality of the > >> > target workloads**: rows of the same metric write in long contiguous > >> > runs sharing one field pattern, and the allocator pins that pattern to > >> > the same columns for the run. Since pushdown stats are fine-grained > >> > (page-level in Parquet, row-group/stripe-level in ORC), the vast > >> > majority of stat blocks hold a single logical field per physical > >> > column, so their min/max stay tight and page / row-group pruning keeps > >> > a high filtering ratio. > >> > > >> > 4. **Column assignment algorithm**. Added the streaming per-row > >> > allocator (ExtendColumnAllocator), which maintains an in-memory column > >> > → owning field state across rows: Hit (a reused field keeps its > >> > column), Evict (a new field takes a free column, else the LRU column > >> > is evicted), Retain (untouched columns keep their owner, so stable > >> > groups stay stable), Overflow (extras beyond K spill). To your > >> > specific questions: it is not a per-row first-fit, and there is no > >> > frozen "dominant mapping" — the state evolves continuously; when a > >> > row's field set conflicts the allocator simply re-pins, and > >> > correctness never depends on conflict-freedom (it's backed by > >> > __field_mapping). > >> > > >> > 5. **K sizing & overflow.** Specified explicitly. K_next = > >> > min(P99_row_width(recent files), K_max), adapting in both directions > >> > across files; K_max (default 256) bounds growth so a key explosion > >> > can't create unbounded columns. There is no fixed default like 16 — > >> > the first file simply starts at K_max (no prior files to adapt from), > >> > and an over-wide first file only affects that one file before > >> > adaptation converges. So overflow only catches long-tail rows in > >> > steady state. > >> > > >> > 6. **VARIANT shredding.** Added a dedicated section. The inference / > >> > reconstruct / plumbing layers (VariantShreddingWriter's ShreddedResult > >> > builder, PaimonShreddedRow + RowToColumnConverter, and ShreddingUtils > >> > / VariantUtils) are good candidates to refactor and reuse. The genuine > >> > difference is the **write-path column layout** — shredding binds one > >> > fixed column per field plus a blob, whereas extend reuses K columns > >> > per-row with a typed overflow. The PIP proposes generalizing the > >> > shared layers rather than building a parallel stack. > >> > > >> > 7. **Sparse-row space overhead.** Added a break-even analysis: > >> > columnar-extend pays off when rows are grouped and locally homogeneous > >> > and each row's field count is a meaningful fraction of K; default MAP > >> > can be preferable for very small, highly heterogeneous rows. The PIP > >> > gives concrete guidance so users can decide when to enable it. > >> > > >> > 8. **__field_mapping length**. Made explicit: it is **fixed length > >> > K**, one entry per physical column, with sentinel -1 for an empty > >> > column (examples updated accordingly). Fixed length keeps position → > >> > column deterministic with no extra position metadata, and it still > >> > compresses well since same-group rows share an identical mapping under > >> > RLE. > >> > > >> > 9. **Read-path pruning**. Made explicit: if a query does not reference > >> > the MAP column, the whole struct — including __field_mapping — is > >> > never projected and never read (struct-level column pruning). > >> > __field_mapping is a mandatory read **only** for queries that actually > >> > access the MAP. > >> > > >> > I also added a short section relating this to PR #7877 (map > >> > shredding). The two sit close on the same axis (dedicated-per-key vs. > >> > reused-across-keys columns) and could **converge on a single > >> > Struct-based framework** via a config switch, so I'd suggest we > >> > explore aligning the two efforts rather than maintaining parallel > >> > stacks. > >> > > >> > Thank you again for taking the time on such a careful review — it > >> > genuinely improved the proposal. I'd be very happy to discuss any of > >> > these further, and I look forward to your thoughts. > >> > > >> > Best regards, > >> > > >> > Xinyu Liu > >> > > >> > On Wed, Jun 3, 2026 at 5:23 PM Jingsong Li <[email protected]> > >> wrote: > >> > > > >> > > Hi Xinyu, > >> > > > >> > > Thanks for driving this PIP. > >> > > > >> > > +1 for this! > >> > > > >> > > Left some comments: > >> > > > >> > > 1. Configuration option naming inconsistency > >> > > > >> > > The "Public Interfaces" section says: > >> > > > >> > > ▎ <column>.columnar-extend.enabled = true > >> > > > >> > > But the "Proposed Changes" section uses: > >> > > > >> > > ▎ <column>.map-storage-layout = 'extend' > >> > > > >> > > These are two different APIs for enabling the same feature. Which > >> > > one is it? The enum-based map-storage-layout is more extensible, but > >> > > the doc needs to be self-consistent. I'd recommend the enum approach > >> > > (map-storage-layout = extend) and dropping the .enabled flag. > >> > > > >> > > 2. __field_mapping — strings or int IDs? > >> > > > >> > > The physical layout examples show [usage, load, iowait] (strings), > >> > > but the text says: > >> > > > >> > > ▎ "The field name <-> id dictionary is stored in file metadata; > >> > > __field_mapping holds only int ids" > >> > > > >> > > The examples should be corrected to show int IDs (e.g., [0, 1, 2]) > >> > > to avoid confusion. > >> > > > >> > > You should describe a specific example, specifically how it is stored > >> > > and how the mapping between actual data and real data is. > >> > > > >> > > 3. Predicate pushdown correctness is under-specified > >> > > > >> > > The query path says: > >> > > > >> > > ▎ "From file metadata, look up the dictionary: usage → its physical > >> > > column set S (e.g., {col_0}). Translate the logical predicate usage > > >> > > 30 into a physical sub-column predicate over S (col_0 > 30)." > >> > > > >> > > This is only correct if col_0 always holds "usage" within the entire > >> > > row group. But the design allows different rows to map different > >> > > fields to the same physical column. If col_0 holds "usage" in row 1 > >> > > and "rss" in row 5, then a row-group-level min/max stat on col_0 is > >> > > meaningless — it mixes values from different logical fields. > >> > > > >> > > The doc needs to explicitly address: > >> > > > >> > > - Within a row group, is a physical column guaranteed to always map > >> > > to the same logical field? If yes, this is a hard constraint the > >> > > writer must enforce (which limits flexibility). If no, predicate > >> > > pushdown can only use the stats as a coarse pre-filter and must verify > >> > > via __field_mapping per row. > >> > > - How does the "writer column layout optimization" ensure this > >> > > invariant? What happens when it can't (e.g., two rows in the same > >> > > group map different fields to col_0)? > >> > > > >> > > This is the most critical correctness concern in the design. > >> > > > >> > > 4. Column assignment algorithm is missing > >> > > > >> > > ▎ "Writer performs column layout optimization while writing: > >> > > consecutive rows with similar field sets keep consistent physical > >> > > column positions, minimizing read amplification." > >> > > > >> > > This is a one-sentence hand-wave over what is arguably the most > >> > > important algorithmic component. The assignment strategy directly > >> > > impacts: > >> > > > >> > > - Whether predicate pushdown can work at all (see #3) > >> > > - Read amplification (how many physical columns you need to read for > >> > > one logical field) > >> > > - Overflow frequency > >> > > > >> > > I'd expect the design to specify at least the basic algorithm. For > >> example: > >> > > - Is it a greedy first-fit per row? > >> > > - Is there a "dominant mapping" per row group established from the > >> > > first N rows? > >> > > - What happens when a row's field set conflicts with the established > >> mapping? > >> > > > >> > > 5. Overflow strategy and K sizing > >> > > > >> > > With K=16 (default) and the doc's own example of 5~50 fields per > >> > > row, many rows will have 34+ fields in overflow. The overflow uses > >> > > MAP<INT, T> — which has the same "no columnar access" problem the > >> > > whole design is trying to solve. > >> > > > >> > > The doc says "persistent overflow drives K up in later files" but: > >> > > - What's the adaptation formula? K = max(row_width) from the last > >> > > file? A percentile (p95, p99)? > >> > > - Is there an upper bound on K? With 50,000 possible fields, K could > >> > > grow unbounded. > >> > > - For the initial file, the user-configured K=16 may be a bad > >> > > default for "5~50 fields per row" scenarios. > >> > > > >> > > The doc should provide clearer guidance on K sizing and set > >> > > expectations about overflow rates. > >> > > > >> > > 6. Relationship with existing VARIANT shredding infrastructure > >> > > > >> > > The Paimon codebase already has a mature VARIANT shredding > >> > > implementation (PaimonShreddingUtils, VariantSchema, > >> > > VariantShreddingWriter, inference infrastructure in > >> > > InferVariantShreddingSchema). There are clear architectural parallels: > >> > > > >> > > - Both decompose a semi-structured type into typed sub-columns > >> > > - Both need a mapping/metadata layer to reconstruct the original > >> value > >> > > - Both integrate with Parquet/ORC reader/writer pipelines > >> > > > >> > > The PIP should discuss whether code can be shared or patterns > >> > > reused. For example, the inference mechanism > >> > > (InferVariantShreddingSchema) could inform the adaptive K algorithm. > >> > > > >> > > 7. Memory/space overhead for sparse rows > >> > > > >> > > If K=16 but a row has only 2 fields, 14 physical columns are NULL. > >> > > Plus each row carries a LIST<INT> for __field_mapping. For rows with > >> > > small field counts, this struct-based layout may actually be worse > >> > > than the default MAP storage in terms of space. > >> > > > >> > > The doc should include a rough analysis of when columnar-extend > >> > > breaks even versus default MAP, so users can make informed decisions > >> > > about when to enable it. > >> > > > >> > > 8. __field_mapping as LIST<INT> — length vs. K > >> > > > >> > > If __field_mapping has length K (one entry per physical column), a > >> > > missing field in position i could be represented as a sentinel (-1 or > >> > > similar). If it has variable length equal to the number of non-null > >> > > fields, you need a way to know which column each entry maps to. > >> > > > >> > > The doc's example shows [usage, load, --] where -- seems to mean "no > >> > > field", suggesting fixed-length K. This should be made explicit. A > >> > > fixed-length LIST where each position corresponds to a physical column > >> > > is simpler but less compressible; a variable-length list is more > >> > > compact but requires position metadata. > >> > > > >> > > 9. Read path — missing detail on column pruning > >> > > > >> > > The query path says: > >> > > > >> > > ▎ "Issue one read with physical schema {__field_mapping} + S" > >> > > > >> > > But if __field_mapping is always required for every query (to > >> > > confirm which column holds which field), it becomes a mandatory read > >> > > cost. For queries that touch a single field, this is acceptable. For > >> > > queries that don't touch the MAP at all, does the reader skip > >> > > __field_mapping entirely? This should be explicit. > >> > > > >> > > --- > >> > > Minor Issues > >> > > > >> > > - The __overflow type is shown as MAP<INT, T> (int keys) in the > >> > > struct definition but {steal: 0.3} (string keys) in the example. > >> > > Should be consistent. > >> > > - The comparison table says "Predicate pushdown: All keys" for > >> > > columnar-extend, but as discussed in #3, this is only true under > >> > > specific column assignment constraints that aren't guaranteed. > >> > > - The "opt-in" column name <column>.columnar-extend.enabled uses a > >> > > different column property prefix convention than the typical Paimon > >> > > table options — worth aligning with existing conventions. > >> > > > >> > > --- > >> > > > >> > > Best, > >> > > Jingsong > >> > > > >> > > On Wed, Jun 3, 2026 at 4:29 PM 刘欣瑀 <[email protected]> wrote: > >> > > > > >> > > > Hi everyone, > >> > > > > >> > > > I'd like to start a discussion on a storage optimization for > >> `MAP<STRING, T>` columns targeting time-series, IoT, observability, and > >> similar scenarios. > >> > > > > >> > > > ### Problem > >> > > > > >> > > > In these workloads, data is **"globally heterogeneous but locally > >> homogeneous"** — the global key union across all rows can reach tens of > >> thousands, but each row only carries 5~50 keys that are highly repetitive > >> within groups (e.g., the same reportor always reports `{usage, load, > >> iowait}`). > >> > > > > >> > > > Current options all fall short: > >> > > > > >> > > > - **Default MAP storage** (KV arrays): no per-key predicate > >> pushdown, no per-key column pruning, no per-key statistics. > >> > > > > >> > > > - **VARIANT**: unshredded fields (>90% in these scenarios) fall into > >> a binary blob, losing all columnar advantages. > >> > > > > >> > > > - **Wide table**: flattening 50,000+ fields into columns results in > >> >99% NULL, with metadata explosion and unbounded schema churn. > >> > > > > >> > > > ### Proposed Solution: Columnar-Extend > >> > > > > >> > > > We propose an **opt-in storage optimization** for MAP columns — > >> enabled via a table option: > >> > > > > >> > > > ```sql > >> > > > > >> > > > CREATE TABLE metrics ( > >> > > > > >> > > > ts TIMESTAMP, > >> > > > > >> > > > metric STRING, > >> > > > > >> > > > ext-map MAP<STRING, DOUBLE> > >> > > > > >> > > > ) WITH ( > >> > > > > >> > > > 'ext-map.map-storage-layout' = 'extend', > >> > > > > >> > > > 'ext-map.columnar-extend.num-columns' = '16' > >> > > > > >> > > > ); > >> > > > > >> > > > ``` > >> > > > > >> > > > The key idea: instead of storing MAP entries as KV arrays, > >> physically rewrite them into a **Struct with `K` typed reusable columns** > >> plus a lightweight `__field_mapping`. This gives every key full columnar > >> treatment — predicate pushdown, column pruning, native statistics — while > >> keeping the column count bounded at `K` (tens, not tens of thousands). Rows > >> exceeding `K` keys spill into a small overflow map, so correctness never > >> depends on `K` being large enough. `K` adapts across files based on the > >> actual data width. > >> > > > > >> > > > The logical type stays `MAP<STRING, T>` — the optimization is > >> transparent to users. Existing queries like `ext-map['usage'] > 30` work > >> unchanged; the engine translates them into physical sub-column predicates > >> internally. > >> > > > > >> > > > ### PIP Document > >> > > > > >> > > > The full proposal — including physical layout, query path, public > >> interface changes, and rejected alternatives — is available here: > >> https://cwiki.apache.org/confluence/display/PAIMON/PIP-43%3A+Columnar-Extend+Storage+Optimization+for+MAP+Type+in+Paimon > >> > > > > >> > > > ### Looking for Feedback > >> > > > > >> > > > I'd appreciate community feedback on: > >> > > > > >> > > > 1. The overall approach — e.g., column count exceeding K with > >> `__overflow` vs. other strategies. > >> > > > > >> > > > 2. The configuration design (`map-storage-layout` enum, > >> `num-columns`). > >> > > > > >> > > > 3. Any concerns about compatibility. > >> > > > > >> > > > 4. Additional use cases — beyond time-series/IoT/observability, are > >> there other scenarios in your workloads where MAP columns have > >> high-cardinality, locally-repetitive keys that would benefit from this > >> optimization? > >> > > > > >> > > > Looking forward to the discussion! > >> > > > > >> > > > Best regards, > >> > > > > >> > > > Xinyu > >>
