Hi Jingsong, Thanks for sharing this — the design looks really promising for wide table scenarios.
The projection latency numbers stand out in particular. 2.3ms for 1 column out of 10,000 is a meaningful result, and the name-based bucketing aligns well with real-world patterns where columns tend to share common prefixes (e.g., feature stores or multi-modal metadata like `image_*`). A few questions as this evolves: - Schema evolution: How does Mosaic handle column additions or renames? Since bucket assignment is range-based on column names, a rename could shift a column across bucket boundaries — curious if there's a planned strategy for that. - Filter pushdown: Is predicate pushdown on the roadmap, or is the current focus primarily on projection? For feature serving workloads, point lookups with filters could be another interesting optimization target. - Repository: A standalone repo might make it easier for other projects to adopt it independently, without taking on Paimon as a dependency — though I'm curious how you're thinking about this. Looking forward to the RFC and seeing this develop further! Best, Dapeng Jingsong Li <[email protected]> 于2026年5月13日周三 18:00写道: > Hi everyone, > > I'd like to introduce a new file format for the wide table. > > Mosaic is a columnar-bucket hybrid format optimized for wide tables > (10,000+ columns). Columns are sorted by name and evenly distributed > into buckets using range-based assignment, stored column-oriented > within each bucket, and independently compressed. This enables > efficient projection pushdown at bucket granularity — reading 10 > columns out of 10,000 only decompresses the buckets that contain those > 10 columns. Range-based assignment ensures that columns with similar > name prefixes land in the same bucket, improving both compression > ratio and projection locality. > > - Columns are grouped into buckets by name, enabling selective I/O > — read only the buckets you need. > - Each column is automatically encoded as ALL_NULL, CONST, DICT, or > PLAIN based on its data distribution. > - Optional Zstandard compression for both data buckets and the schema > block, with configurable compression level. > - Byte Pair Encoding compresses column names in the schema block, > reducing metadata overhead for wide tables. > - 18 data types from Boolean to TimestampLtz, with support for > fixed-width and variable-length encodings. > > +--------------------------------------------+ > | Row Group 0: Bucket Data | > | [Bucket 0 compressed block] | > | [Bucket 3 compressed block] | > | ... (only non-empty buckets) | > +--------------------------------------------+ > | Row Group 1: Bucket Data | > | ... | > +--------------------------------------------+ > | Schema Block | > | [4 bytes: uncompressed size (BE int)] | > | [schema data (possibly compressed)] | > +--------------------------------------------+ > | Row Group Index (varint encoded) | > +--------------------------------------------+ > | Footer (32 bytes, fixed) | > +--------------------------------------------+ > > Benchmark compared to Parquet and ORC: > > Test setup: 10,000 columns (90% STRING, 10% INT), column names ~80 > bytes each, Zstd compression (level 9). > > **File Size (10 rows):** > > | Format | Size | vs Mosaic | > |---------|------------|-----------| > | Parquet | 9,696 KB | 14.8x | > | ORC | 6,377 KB | 9.7x | > | Mosaic | 654 KB | 1x | > > **Projection Read (500 rows):** > > | Projected Columns | Parquet | ORC | Mosaic | > |-------------------|------------|------------|-----------| > | 10 / 10,000 | 53,170 us | 72,729 us | 25,081 us | > | 1 / 10,000 | 50,919 us | 70,712 us | 2,374 us | > > File size — Parquet: 57.4 MB, ORC: 95.4 MB, Mosaic: 11.5 MB > > **Projection Read (4,500 rows, ~458 MB Parquet):** > > | Projected Columns | Parquet | ORC | Mosaic | > |-------------------|-------------|------------|------------| > | 10 / 10,000 | 369,627 us | 89,344 us | 67,314 us | > | 1 / 10,000 | 360,458 us | 81,934 us | 26,924 us | > > File size — Parquet: 458.4 MB, ORC: 827.9 MB, Mosaic: 100.2 MB > > When projecting a small subset of columns, Mosaic only decompresses > the buckets containing the requested columns, avoiding I/O on the > remaining data. > > POC is in https://github.com/JingsongLi/paimon/tree/fast_format > > We may need to create a separate repo for it. > > What do you think? > > Best, > Jingsong >
