QuakeWang opened a new issue, #378:
URL: https://github.com/apache/paimon-rust/issues/378

   ### Search before asking
   
   - [x] I searched in the 
[issues](https://github.com/apache/paimon-rust/issues) and found nothing 
similar.
   
   
   ### Motivation
   
   Apache Paimon already supports the Mosaic file format as a wide-table 
optimized data format. Mosaic is a columnar-bucket hybrid format that groups 
columns into deterministic buckets and compresses each bucket independently, 
reducing read amplification for very wide tables when queries project only a 
small subset of columns.
   
   Currently, `paimon-rust` dispatches data file readers by file extension and 
supports formats such as Parquet, ORC, Avro, Blob, and feature-gated Vortex. 
However, `.mosaic` files are not recognized. As a result, Rust clients cannot 
read Paimon tables whose `file.format` is set to `mosaic`, even if those tables 
were written by Java Paimon or PyPaimon.
   
   This matters for compatibility with the broader Paimon ecosystem, especially 
for wide-table workloads where Mosaic is the recommended format.
   
   ### Solution
   
   Add feature-gated support for reading Mosaic data files in `paimon-rust`, 
with the implementation split into small reviewable steps.
   
   Proposed phases:
   
   1. **Reader foundation**
   
      Add an optional `mosaic` feature, depend on `paimon-mosaic-core` behind 
that feature, and introduce a `MosaicFormatReader` wired into the existing 
file-format dispatch for `.mosaic` files.
   
      This phase should provide basic Arrow `RecordBatch` reading and 
projection support.
   
   2. **Paimon read-path correctness**
   
      Integrate the Mosaic reader with the existing table read semantics, 
including schema evolution, missing-column null filling, projection order, 
deletion vectors, and row-range selection.
   
      This phase should add table-level and mixed-format tests to ensure Mosaic 
files behave consistently with other supported formats.
   
   3. **Documentation and compatibility polish**
   
      Document the feature flag, supported scope, and current limitations. The 
initial scope should be read-only and focused on compatibility with Mosaic 
files written by Java Paimon, PyPaimon, or `paimon-mosaic-core`.
   
   Out of scope for the first phase:
   
   - Writing `.mosaic` data files from `paimon-rust`.
   - Emitting Mosaic row-group statistics into `DataFileMeta.value_stats`.
   - Making Mosaic the default file format.
   - Implementing Mosaic bloom filter support.
   - Changing the Mosaic storage format.
   
   Follow-up work can add writer support, stats integration, and performance 
benchmarks once read compatibility is stable.
   
   ### Anything else?
   
   Relevant context:
   
   - `paimon-rust` already has a format abstraction through `FormatFileReader` 
and `FormatFileWriter`.
   - `paimon-mosaic-core` and `paimon-rust` both currently use Arrow 58, so a 
pure Rust integration should avoid the JNI/native library loading issues that 
exist in the Java integration path.
   - The first implementation should be conservative and feature-gated because 
Mosaic is still evolving and has format-specific options such as bucket count, 
ZSTD level, and stats columns.
   - This issue should prioritize ecosystem read compatibility first. 
Performance optimizations and writer support can be tracked separately after 
the initial reader is merged.
   
   
   ### Willingness to contribute
   
   - [x] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to