Re: [DISCUSS] Introduce a new file format for wide table

Jingsong Li Wed, 13 May 2026 06:31:53 -0700

Hi Dapeng,

We may be able to support filter pushdown, such as storing min max,
specifying the columns that need to build stats, and not building them
by default without occupying storage.


Best,
Jingsong

On Wed, May 13, 2026 at 7:41 PM Jingsong Li <[email protected]> wrote:
>
> Thanks Dapeng for your feedback.
>
> - Schema evolution: Actually, this ability should be handled by the
> Paimon layer, which will evolve the schema based on the difference
> between the file's Schema ID and the currently read schema. However,
> the format itself should also have some ability to read based on
> column names, and columns without them will return NULL, and handle
> simple type changes, just like Parquet is used in Paimon.
>
> - Filter pushdown: The first version did not plan to carry out Filter
> PushDown, and perhaps we need to support specifying statistical
> information for certain columns in the future, but this is far away.
>
> - Repository: We will first incubate it in the Paimon community until
> the ecosystem is more robust, such as using it for other table
> formats, and then consider a separate repository.
>
> Best,
> Jingsong
>
> On Wed, May 13, 2026 at 7:09 PM Dapeng Sun <[email protected]> wrote:
> >
> > Hi Jingsong,
> >
> > Thanks for sharing this — the design looks really promising for wide table
> > scenarios.
> >
> > The projection latency numbers stand out in particular. 2.3ms for 1 column
> > out of 10,000 is a meaningful result, and the name-based bucketing aligns
> > well with real-world patterns where columns tend to share common prefixes
> > (e.g., feature stores or multi-modal metadata like `image_*`).
> >
> > A few questions as this evolves:
> >
> > - Schema evolution: How does Mosaic handle column additions or renames?
> > Since bucket assignment is range-based on column names, a rename could
> > shift a column across bucket boundaries — curious if there's a planned
> > strategy for that.
> > - Filter pushdown: Is predicate pushdown on the roadmap, or is the current
> > focus primarily on projection? For feature serving workloads, point lookups
> > with filters could be another interesting optimization target.
> > - Repository: A standalone repo might make it easier for other projects to
> > adopt it independently, without taking on Paimon as a dependency — though
> > I'm curious how you're thinking about this.
> >
> > Looking forward to the RFC and seeing this develop further!
> >
> > Best,
> > Dapeng
> >
> > Jingsong Li <[email protected]> 于2026年5月13日周三 18:00写道：
> >
> > > Hi everyone,
> > >
> > > I'd like to introduce a new file format for the wide table.
> > >
> > > Mosaic is a columnar-bucket hybrid format optimized for wide tables
> > > (10,000+ columns). Columns are sorted by name and evenly distributed
> > > into buckets using range-based assignment, stored column-oriented
> > > within each bucket, and independently compressed. This enables
> > > efficient projection pushdown at bucket granularity — reading 10
> > > columns out of 10,000 only decompresses the buckets that contain those
> > > 10 columns. Range-based assignment ensures that columns with similar
> > > name prefixes land in the same bucket, improving both compression
> > > ratio and projection locality.
> > >
> > > - Columns are grouped into buckets by name, enabling selective I/O
> > > &mdash; read only the buckets you need.
> > > - Each column is automatically encoded as ALL_NULL, CONST, DICT, or
> > > PLAIN based on its data distribution.
> > > - Optional Zstandard compression for both data buckets and the schema
> > > block, with configurable compression level.
> > > - Byte Pair Encoding compresses column names in the schema block,
> > > reducing metadata overhead for wide tables.
> > > - 18 data types from Boolean to TimestampLtz, with support for
> > > fixed-width and variable-length encodings.
> > >
> > > +--------------------------------------------+
> > > |  Row Group 0: Bucket Data                  |
> > > |    [Bucket 0 compressed block]             |
> > > |    [Bucket 3 compressed block]             |
> > > |    ...  (only non-empty buckets)           |
> > > +--------------------------------------------+
> > > |  Row Group 1: Bucket Data                  |
> > > |    ...                                     |
> > > +--------------------------------------------+
> > > |  Schema Block                              |
> > > |    [4 bytes: uncompressed size (BE int)]   |
> > > |    [schema data (possibly compressed)]     |
> > > +--------------------------------------------+
> > > |  Row Group Index (varint encoded)          |
> > > +--------------------------------------------+
> > > |  Footer (32 bytes, fixed)                  |
> > > +--------------------------------------------+
> > >
> > > Benchmark compared to Parquet and ORC:
> > >
> > >   Test setup: 10,000 columns (90% STRING, 10% INT), column names ~80
> > > bytes each, Zstd compression (level 9).
> > >
> > >   **File Size (10 rows):**
> > >
> > >   | Format  | Size       | vs Mosaic |
> > >   |---------|------------|-----------|
> > >   | Parquet | 9,696 KB   | 14.8x     |
> > >   | ORC     | 6,377 KB   | 9.7x      |
> > >   | Mosaic  | 654 KB     | 1x        |
> > >
> > >   **Projection Read (500 rows):**
> > >
> > >   | Projected Columns | Parquet    | ORC        | Mosaic    |
> > >   |-------------------|------------|------------|-----------|
> > >   | 10 / 10,000       | 53,170 us  | 72,729 us  | 25,081 us |
> > >   | 1 / 10,000        | 50,919 us  | 70,712 us  | 2,374  us |
> > >
> > >   File size — Parquet: 57.4 MB, ORC: 95.4 MB, Mosaic: 11.5 MB
> > >
> > >   **Projection Read (4,500 rows, ~458 MB Parquet):**
> > >
> > >   | Projected Columns | Parquet     | ORC        | Mosaic     |
> > >   |-------------------|-------------|------------|------------|
> > >   | 10 / 10,000       | 369,627 us  | 89,344 us  | 67,314 us  |
> > >   | 1 / 10,000        | 360,458 us  | 81,934 us  | 26,924 us  |
> > >
> > >   File size — Parquet: 458.4 MB, ORC: 827.9 MB, Mosaic: 100.2 MB
> > >
> > > When projecting a small subset of columns, Mosaic only decompresses
> > > the buckets containing the requested columns, avoiding I/O on the
> > > remaining data.
> > >
> > > POC is in https://github.com/JingsongLi/paimon/tree/fast_format
> > >
> > > We may need to create a separate repo for it.
> > >
> > > What do you think?
> > >
> > > Best,
> > > Jingsong
> > >

Re: [DISCUSS] Introduce a new file format for wide table

Reply via email to