Re: [DISCUSS] Introduce a new file format for wide table

Dapeng Sun Wed, 13 May 2026 20:45:26 -0700

Hi Jingsong,

+1 on the opt-in design. Defaulting to no stats makes sense for wide-table
workloads — computing min/max across thousands of columns introduces
non-trivial metadata overhead, and letting users explicitly opt in for
frequently filtered columns (e.g. hot feature columns) strikes a good
balance.


Best,
Dapeng

Jingsong Li <[email protected]> 于2026年5月13日周三 21:31写道：

> Hi Dapeng,
>
> We may be able to support filter pushdown, such as storing min max,
> specifying the columns that need to build stats, and not building them
> by default without occupying storage.
>
> Best,
> Jingsong
>
> On Wed, May 13, 2026 at 7:41 PM Jingsong Li <[email protected]>
> wrote:
> >
> > Thanks Dapeng for your feedback.
> >
> > - Schema evolution: Actually, this ability should be handled by the
> > Paimon layer, which will evolve the schema based on the difference
> > between the file's Schema ID and the currently read schema. However,
> > the format itself should also have some ability to read based on
> > column names, and columns without them will return NULL, and handle
> > simple type changes, just like Parquet is used in Paimon.
> >
> > - Filter pushdown: The first version did not plan to carry out Filter
> > PushDown, and perhaps we need to support specifying statistical
> > information for certain columns in the future, but this is far away.
> >
> > - Repository: We will first incubate it in the Paimon community until
> > the ecosystem is more robust, such as using it for other table
> > formats, and then consider a separate repository.
> >
> > Best,
> > Jingsong
> >
> > On Wed, May 13, 2026 at 7:09 PM Dapeng Sun <[email protected]> wrote:
> > >
> > > Hi Jingsong,
> > >
> > > Thanks for sharing this — the design looks really promising for wide
> table
> > > scenarios.
> > >
> > > The projection latency numbers stand out in particular. 2.3ms for 1
> column
> > > out of 10,000 is a meaningful result, and the name-based bucketing
> aligns
> > > well with real-world patterns where columns tend to share common
> prefixes
> > > (e.g., feature stores or multi-modal metadata like `image_*`).
> > >
> > > A few questions as this evolves:
> > >
> > > - Schema evolution: How does Mosaic handle column additions or renames?
> > > Since bucket assignment is range-based on column names, a rename could
> > > shift a column across bucket boundaries — curious if there's a planned
> > > strategy for that.
> > > - Filter pushdown: Is predicate pushdown on the roadmap, or is the
> current
> > > focus primarily on projection? For feature serving workloads, point
> lookups
> > > with filters could be another interesting optimization target.
> > > - Repository: A standalone repo might make it easier for other
> projects to
> > > adopt it independently, without taking on Paimon as a dependency —
> though
> > > I'm curious how you're thinking about this.
> > >
> > > Looking forward to the RFC and seeing this develop further!
> > >
> > > Best,
> > > Dapeng
> > >
> > > Jingsong Li <[email protected]> 于2026年5月13日周三 18:00写道：
> > >
> > > > Hi everyone,
> > > >
> > > > I'd like to introduce a new file format for the wide table.
> > > >
> > > > Mosaic is a columnar-bucket hybrid format optimized for wide tables
> > > > (10,000+ columns). Columns are sorted by name and evenly distributed
> > > > into buckets using range-based assignment, stored column-oriented
> > > > within each bucket, and independently compressed. This enables
> > > > efficient projection pushdown at bucket granularity — reading 10
> > > > columns out of 10,000 only decompresses the buckets that contain
> those
> > > > 10 columns. Range-based assignment ensures that columns with similar
> > > > name prefixes land in the same bucket, improving both compression
> > > > ratio and projection locality.
> > > >
> > > > - Columns are grouped into buckets by name, enabling selective I/O
> > > > &mdash; read only the buckets you need.
> > > > - Each column is automatically encoded as ALL_NULL, CONST, DICT, or
> > > > PLAIN based on its data distribution.
> > > > - Optional Zstandard compression for both data buckets and the schema
> > > > block, with configurable compression level.
> > > > - Byte Pair Encoding compresses column names in the schema block,
> > > > reducing metadata overhead for wide tables.
> > > > - 18 data types from Boolean to TimestampLtz, with support for
> > > > fixed-width and variable-length encodings.
> > > >
> > > > +--------------------------------------------+
> > > > |  Row Group 0: Bucket Data                  |
> > > > |    [Bucket 0 compressed block]             |
> > > > |    [Bucket 3 compressed block]             |
> > > > |    ...  (only non-empty buckets)           |
> > > > +--------------------------------------------+
> > > > |  Row Group 1: Bucket Data                  |
> > > > |    ...                                     |
> > > > +--------------------------------------------+
> > > > |  Schema Block                              |
> > > > |    [4 bytes: uncompressed size (BE int)]   |
> > > > |    [schema data (possibly compressed)]     |
> > > > +--------------------------------------------+
> > > > |  Row Group Index (varint encoded)          |
> > > > +--------------------------------------------+
> > > > |  Footer (32 bytes, fixed)                  |
> > > > +--------------------------------------------+
> > > >
> > > > Benchmark compared to Parquet and ORC:
> > > >
> > > >   Test setup: 10,000 columns (90% STRING, 10% INT), column names ~80
> > > > bytes each, Zstd compression (level 9).
> > > >
> > > >   **File Size (10 rows):**
> > > >
> > > >   | Format  | Size       | vs Mosaic |
> > > >   |---------|------------|-----------|
> > > >   | Parquet | 9,696 KB   | 14.8x     |
> > > >   | ORC     | 6,377 KB   | 9.7x      |
> > > >   | Mosaic  | 654 KB     | 1x        |
> > > >
> > > >   **Projection Read (500 rows):**
> > > >
> > > >   | Projected Columns | Parquet    | ORC        | Mosaic    |
> > > >   |-------------------|------------|------------|-----------|
> > > >   | 10 / 10,000       | 53,170 us  | 72,729 us  | 25,081 us |
> > > >   | 1 / 10,000        | 50,919 us  | 70,712 us  | 2,374  us |
> > > >
> > > >   File size — Parquet: 57.4 MB, ORC: 95.4 MB, Mosaic: 11.5 MB
> > > >
> > > >   **Projection Read (4,500 rows, ~458 MB Parquet):**
> > > >
> > > >   | Projected Columns | Parquet     | ORC        | Mosaic     |
> > > >   |-------------------|-------------|------------|------------|
> > > >   | 10 / 10,000       | 369,627 us  | 89,344 us  | 67,314 us  |
> > > >   | 1 / 10,000        | 360,458 us  | 81,934 us  | 26,924 us  |
> > > >
> > > >   File size — Parquet: 458.4 MB, ORC: 827.9 MB, Mosaic: 100.2 MB
> > > >
> > > > When projecting a small subset of columns, Mosaic only decompresses
> > > > the buckets containing the requested columns, avoiding I/O on the
> > > > remaining data.
> > > >
> > > > POC is in https://github.com/JingsongLi/paimon/tree/fast_format
> > > >
> > > > We may need to create a separate repo for it.
> > > >
> > > > What do you think?
> > > >
> > > > Best,
> > > > Jingsong
> > > >
>

Re: [DISCUSS] Introduce a new file format for wide table

Reply via email to