kszucs opened a new pull request, #9661:
URL: https://github.com/apache/arrow-rs/pull/9661
# Which issue does this PR close?
This PR does not close an existing issue — it is a prototype opened for
early discussion.
# Rationale for this change
Storing multiple versions of a dataset is expensive. CDC-based page
deduplication
can eliminate most of that redundancy with no special storage backend
required.
# What changes are included in this PR?
- `parquet::arrow::page_store` — `PageStoreWriter` and `PageStoreReader`
- Writer re-encodes pages using CDC and writes each as a `{blake3}.page`
blob
into a shared store directory. Identical pages across files are stored
once.
- Reader reassembles data from a lightweight manifest-only Parquet file.
- `parquet-page-store` CLI (`page_store,cli` features): `write`, `read`,
`reconstruct`
- `parquet/examples/page_store_dedup/` — end-to-end demo on real data
(OpenHermes-2.5)
On four variants of an 800 MB dataset (filtered, augmented, appended): 3.1
GB → 563 MB (82% reduction, 5.6×).
# Are these changes tested?
Yes — round-trips, multi-page, multi-row-group, nested types, cross-file
dedup, page integrity, and reader error cases.
# Are there any user-facing changes?
Additive only, gated behind the `page_store` feature flag (off by default).
The API and manifest format are explicitly unstable in this PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]