QuakeWang opened a new issue, #378: URL: https://github.com/apache/paimon-rust/issues/378
### Search before asking - [x] I searched in the [issues](https://github.com/apache/paimon-rust/issues) and found nothing similar. ### Motivation Apache Paimon already supports the Mosaic file format as a wide-table optimized data format. Mosaic is a columnar-bucket hybrid format that groups columns into deterministic buckets and compresses each bucket independently, reducing read amplification for very wide tables when queries project only a small subset of columns. Currently, `paimon-rust` dispatches data file readers by file extension and supports formats such as Parquet, ORC, Avro, Blob, and feature-gated Vortex. However, `.mosaic` files are not recognized. As a result, Rust clients cannot read Paimon tables whose `file.format` is set to `mosaic`, even if those tables were written by Java Paimon or PyPaimon. This matters for compatibility with the broader Paimon ecosystem, especially for wide-table workloads where Mosaic is the recommended format. ### Solution Add feature-gated support for reading Mosaic data files in `paimon-rust`, with the implementation split into small reviewable steps. Proposed phases: 1. **Reader foundation** Add an optional `mosaic` feature, depend on `paimon-mosaic-core` behind that feature, and introduce a `MosaicFormatReader` wired into the existing file-format dispatch for `.mosaic` files. This phase should provide basic Arrow `RecordBatch` reading and projection support. 2. **Paimon read-path correctness** Integrate the Mosaic reader with the existing table read semantics, including schema evolution, missing-column null filling, projection order, deletion vectors, and row-range selection. This phase should add table-level and mixed-format tests to ensure Mosaic files behave consistently with other supported formats. 3. **Documentation and compatibility polish** Document the feature flag, supported scope, and current limitations. The initial scope should be read-only and focused on compatibility with Mosaic files written by Java Paimon, PyPaimon, or `paimon-mosaic-core`. Out of scope for the first phase: - Writing `.mosaic` data files from `paimon-rust`. - Emitting Mosaic row-group statistics into `DataFileMeta.value_stats`. - Making Mosaic the default file format. - Implementing Mosaic bloom filter support. - Changing the Mosaic storage format. Follow-up work can add writer support, stats integration, and performance benchmarks once read compatibility is stable. ### Anything else? Relevant context: - `paimon-rust` already has a format abstraction through `FormatFileReader` and `FormatFileWriter`. - `paimon-mosaic-core` and `paimon-rust` both currently use Arrow 58, so a pure Rust integration should avoid the JNI/native library loading issues that exist in the Java integration path. - The first implementation should be conservative and feature-gated because Mosaic is still evolving and has format-specific options such as bucket count, ZSTD level, and stats columns. - This issue should prioritize ecosystem read compatibility first. Performance optimizations and writer support can be tracked separately after the initial reader is merged. ### Willingness to contribute - [x] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
