kszucs commented on PR #45360: URL: https://github.com/apache/arrow/pull/45360#issuecomment-2679634013
> Thanks for doing this @kszucs ! I like how this doesn't need any changes to readers. Thanks @rok for taking a look! > > Questions: > > * As it stands in this PR, CDC is either on or off for all columns. How about enabling it per column? In general case some columns might not be worthy candidates for it. Yes, it is either on or off for all the columns. The main disadvantage of having content defined chunking enabled for all the columns is a little performance overhead but from the storage systems' perspective, it is better to have the entire parquet file deduplication optimized to spare disk space and network. > * Use case described in [HF blogpost](https://huggingface.co/blog/improve_parquet_dedupe) describes cases where rows are added or removed but not much else is changed. Wouldn't it then make sense to first try a shortcut deduplication where if we identify a duplication in the first column we first check for the same duplication at the same indices in all other columns before running a full hashing pass? The CDC process itself doesn't identify duplications in the parquet file itself, it only ensure that the columns are consistently chunked depending on the stream of data rather than a fixed number of records (or fixed page size in bytes). The actual deduplication is done by the storage system where these deduplication optimized parquet files are getting uploaded. A very simple in-memory content addressable storage system implementation is available in the [evaluation tool I have been using](https://github.com/kszucs/de/blob/main/src/store.rs). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
