Re: [DISCUSS] Introduce a new file format for wide table

Dapeng Sun Wed, 13 May 2026 04:10:20 -0700

Hi Jingsong,

Thanks for sharing this — the design looks really promising for wide table
scenarios.


The projection latency numbers stand out in particular. 2.3ms for 1 column
out of 10,000 is a meaningful result, and the name-based bucketing aligns
well with real-world patterns where columns tend to share common prefixes
(e.g., feature stores or multi-modal metadata like `image_*`).

A few questions as this evolves:

- Schema evolution: How does Mosaic handle column additions or renames?
Since bucket assignment is range-based on column names, a rename could
shift a column across bucket boundaries — curious if there's a planned
strategy for that.
- Filter pushdown: Is predicate pushdown on the roadmap, or is the current
focus primarily on projection? For feature serving workloads, point lookups
with filters could be another interesting optimization target.
- Repository: A standalone repo might make it easier for other projects to
adopt it independently, without taking on Paimon as a dependency — though
I'm curious how you're thinking about this.

Looking forward to the RFC and seeing this develop further!

Best,
Dapeng

Jingsong Li <[email protected]> 于2026年5月13日周三 18:00写道：

> Hi everyone,
>
> I'd like to introduce a new file format for the wide table.
>
> Mosaic is a columnar-bucket hybrid format optimized for wide tables
> (10,000+ columns). Columns are sorted by name and evenly distributed
> into buckets using range-based assignment, stored column-oriented
> within each bucket, and independently compressed. This enables
> efficient projection pushdown at bucket granularity — reading 10
> columns out of 10,000 only decompresses the buckets that contain those
> 10 columns. Range-based assignment ensures that columns with similar
> name prefixes land in the same bucket, improving both compression
> ratio and projection locality.
>
> - Columns are grouped into buckets by name, enabling selective I/O
> &mdash; read only the buckets you need.
> - Each column is automatically encoded as ALL_NULL, CONST, DICT, or
> PLAIN based on its data distribution.
> - Optional Zstandard compression for both data buckets and the schema
> block, with configurable compression level.
> - Byte Pair Encoding compresses column names in the schema block,
> reducing metadata overhead for wide tables.
> - 18 data types from Boolean to TimestampLtz, with support for
> fixed-width and variable-length encodings.
>
> +--------------------------------------------+
> |  Row Group 0: Bucket Data                  |
> |    [Bucket 0 compressed block]             |
> |    [Bucket 3 compressed block]             |
> |    ...  (only non-empty buckets)           |
> +--------------------------------------------+
> |  Row Group 1: Bucket Data                  |
> |    ...                                     |
> +--------------------------------------------+
> |  Schema Block                              |
> |    [4 bytes: uncompressed size (BE int)]   |
> |    [schema data (possibly compressed)]     |
> +--------------------------------------------+
> |  Row Group Index (varint encoded)          |
> +--------------------------------------------+
> |  Footer (32 bytes, fixed)                  |
> +--------------------------------------------+
>
> Benchmark compared to Parquet and ORC:
>
>   Test setup: 10,000 columns (90% STRING, 10% INT), column names ~80
> bytes each, Zstd compression (level 9).
>
>   **File Size (10 rows):**
>
>   | Format  | Size       | vs Mosaic |
>   |---------|------------|-----------|
>   | Parquet | 9,696 KB   | 14.8x     |
>   | ORC     | 6,377 KB   | 9.7x      |
>   | Mosaic  | 654 KB     | 1x        |
>
>   **Projection Read (500 rows):**
>
>   | Projected Columns | Parquet    | ORC        | Mosaic    |
>   |-------------------|------------|------------|-----------|
>   | 10 / 10,000       | 53,170 us  | 72,729 us  | 25,081 us |
>   | 1 / 10,000        | 50,919 us  | 70,712 us  | 2,374  us |
>
>   File size — Parquet: 57.4 MB, ORC: 95.4 MB, Mosaic: 11.5 MB
>
>   **Projection Read (4,500 rows, ~458 MB Parquet):**
>
>   | Projected Columns | Parquet     | ORC        | Mosaic     |
>   |-------------------|-------------|------------|------------|
>   | 10 / 10,000       | 369,627 us  | 89,344 us  | 67,314 us  |
>   | 1 / 10,000        | 360,458 us  | 81,934 us  | 26,924 us  |
>
>   File size — Parquet: 458.4 MB, ORC: 827.9 MB, Mosaic: 100.2 MB
>
> When projecting a small subset of columns, Mosaic only decompresses
> the buckets containing the requested columns, avoiding I/O on the
> remaining data.
>
> POC is in https://github.com/JingsongLi/paimon/tree/fast_format
>
> We may need to create a separate repo for it.
>
> What do you think?
>
> Best,
> Jingsong
>

Re: [DISCUSS] Introduce a new file format for wide table

Reply via email to