Hi everyone, I'd like to introduce a new file format for the wide table.
Mosaic is a columnar-bucket hybrid format optimized for wide tables (10,000+ columns). Columns are sorted by name and evenly distributed into buckets using range-based assignment, stored column-oriented within each bucket, and independently compressed. This enables efficient projection pushdown at bucket granularity — reading 10 columns out of 10,000 only decompresses the buckets that contain those 10 columns. Range-based assignment ensures that columns with similar name prefixes land in the same bucket, improving both compression ratio and projection locality. - Columns are grouped into buckets by name, enabling selective I/O — read only the buckets you need. - Each column is automatically encoded as ALL_NULL, CONST, DICT, or PLAIN based on its data distribution. - Optional Zstandard compression for both data buckets and the schema block, with configurable compression level. - Byte Pair Encoding compresses column names in the schema block, reducing metadata overhead for wide tables. - 18 data types from Boolean to TimestampLtz, with support for fixed-width and variable-length encodings. +--------------------------------------------+ | Row Group 0: Bucket Data | | [Bucket 0 compressed block] | | [Bucket 3 compressed block] | | ... (only non-empty buckets) | +--------------------------------------------+ | Row Group 1: Bucket Data | | ... | +--------------------------------------------+ | Schema Block | | [4 bytes: uncompressed size (BE int)] | | [schema data (possibly compressed)] | +--------------------------------------------+ | Row Group Index (varint encoded) | +--------------------------------------------+ | Footer (32 bytes, fixed) | +--------------------------------------------+ Benchmark compared to Parquet and ORC: Test setup: 10,000 columns (90% STRING, 10% INT), column names ~80 bytes each, Zstd compression (level 9). **File Size (10 rows):** | Format | Size | vs Mosaic | |---------|------------|-----------| | Parquet | 9,696 KB | 14.8x | | ORC | 6,377 KB | 9.7x | | Mosaic | 654 KB | 1x | **Projection Read (500 rows):** | Projected Columns | Parquet | ORC | Mosaic | |-------------------|------------|------------|-----------| | 10 / 10,000 | 53,170 us | 72,729 us | 25,081 us | | 1 / 10,000 | 50,919 us | 70,712 us | 2,374 us | File size — Parquet: 57.4 MB, ORC: 95.4 MB, Mosaic: 11.5 MB **Projection Read (4,500 rows, ~458 MB Parquet):** | Projected Columns | Parquet | ORC | Mosaic | |-------------------|-------------|------------|------------| | 10 / 10,000 | 369,627 us | 89,344 us | 67,314 us | | 1 / 10,000 | 360,458 us | 81,934 us | 26,924 us | File size — Parquet: 458.4 MB, ORC: 827.9 MB, Mosaic: 100.2 MB When projecting a small subset of columns, Mosaic only decompresses the buckets containing the requested columns, avoiding I/O on the remaining data. POC is in https://github.com/JingsongLi/paimon/tree/fast_format We may need to create a separate repo for it. What do you think? Best, Jingsong
