Hi everyone,

I'd like to introduce a new file format for the wide table.

Mosaic is a columnar-bucket hybrid format optimized for wide tables
(10,000+ columns). Columns are sorted by name and evenly distributed
into buckets using range-based assignment, stored column-oriented
within each bucket, and independently compressed. This enables
efficient projection pushdown at bucket granularity — reading 10
columns out of 10,000 only decompresses the buckets that contain those
10 columns. Range-based assignment ensures that columns with similar
name prefixes land in the same bucket, improving both compression
ratio and projection locality.

- Columns are grouped into buckets by name, enabling selective I/O
— read only the buckets you need.
- Each column is automatically encoded as ALL_NULL, CONST, DICT, or
PLAIN based on its data distribution.
- Optional Zstandard compression for both data buckets and the schema
block, with configurable compression level.
- Byte Pair Encoding compresses column names in the schema block,
reducing metadata overhead for wide tables.
- 18 data types from Boolean to TimestampLtz, with support for
fixed-width and variable-length encodings.

+--------------------------------------------+
|  Row Group 0: Bucket Data                  |
|    [Bucket 0 compressed block]             |
|    [Bucket 3 compressed block]             |
|    ...  (only non-empty buckets)           |
+--------------------------------------------+
|  Row Group 1: Bucket Data                  |
|    ...                                     |
+--------------------------------------------+
|  Schema Block                              |
|    [4 bytes: uncompressed size (BE int)]   |
|    [schema data (possibly compressed)]     |
+--------------------------------------------+
|  Row Group Index (varint encoded)          |
+--------------------------------------------+
|  Footer (32 bytes, fixed)                  |
+--------------------------------------------+

Benchmark compared to Parquet and ORC:

  Test setup: 10,000 columns (90% STRING, 10% INT), column names ~80
bytes each, Zstd compression (level 9).

  **File Size (10 rows):**

  | Format  | Size       | vs Mosaic |
  |---------|------------|-----------|
  | Parquet | 9,696 KB   | 14.8x     |
  | ORC     | 6,377 KB   | 9.7x      |
  | Mosaic  | 654 KB     | 1x        |

  **Projection Read (500 rows):**

  | Projected Columns | Parquet    | ORC        | Mosaic    |
  |-------------------|------------|------------|-----------|
  | 10 / 10,000       | 53,170 us  | 72,729 us  | 25,081 us |
  | 1 / 10,000        | 50,919 us  | 70,712 us  | 2,374  us |

  File size — Parquet: 57.4 MB, ORC: 95.4 MB, Mosaic: 11.5 MB

  **Projection Read (4,500 rows, ~458 MB Parquet):**

  | Projected Columns | Parquet     | ORC        | Mosaic     |
  |-------------------|-------------|------------|------------|
  | 10 / 10,000       | 369,627 us  | 89,344 us  | 67,314 us  |
  | 1 / 10,000        | 360,458 us  | 81,934 us  | 26,924 us  |

  File size — Parquet: 458.4 MB, ORC: 827.9 MB, Mosaic: 100.2 MB

When projecting a small subset of columns, Mosaic only decompresses
the buckets containing the requested columns, avoiding I/O on the
remaining data.

POC is in https://github.com/JingsongLi/paimon/tree/fast_format

We may need to create a separate repo for it.

What do you think?

Best,
Jingsong

Reply via email to