JingsongLi opened a new pull request, #45:
URL: https://github.com/apache/paimon-mosaic/pull/45

   Fix a non-determinism bug in BPE vocabulary builder where HashMap iteration 
order caused different byte output for identical input data when column names 
share common substrings. The fix uses the pair key as a deterministic 
tie-breaker when multiple pairs have equal frequency.
   
   Add 150+ new tests across Rust, Java, and Python covering:
   - Massive data volumes (up to 10M rows, 1000 columns)
   - 20 data pattern varieties (sawtooth, fibonacci, unicode, sparse, etc.)
   - Encoding/compression/projection interactions (30 tests)
   - Determinism and re-roundtrip stability (18 tests)
   - Fuzz robustness with corrupted input (12 tests)
   - Concurrent multi-threaded reading (5 tests)
   - Binary format byte-level verification (10 tests)
   - Statistics min/max/null_count accuracy (11 tests)
   - Cross-language Rust/Java/Python interoperability (25 tests)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to