kszucs commented on PR #45360:
URL: https://github.com/apache/arrow/pull/45360#issuecomment-2679634013

   > Thanks for doing this @kszucs ! I like how this doesn't need any changes 
to readers.
   
   Thanks @rok for taking a look!
   
   > 
   > Questions:
   > 
   > * As it stands in this PR, CDC is either on or off for all columns. How 
about enabling it per column? In general case some columns might not be worthy 
candidates for it.
   
   Yes, it is either on or off for all the columns. The main disadvantage of 
having content defined chunking enabled for all the columns is a little 
performance overhead but from the storage systems' perspective, it is better to 
have the entire parquet file deduplication optimized to spare disk space and 
network. 
   
   > * Use case described in [HF 
blogpost](https://huggingface.co/blog/improve_parquet_dedupe) describes cases 
where rows are added or removed but not much else is changed. Wouldn't it then 
make sense to first try a shortcut deduplication where if we identify a 
duplication in the first column we first check for the same duplication at the 
same indices in all other columns before running a full hashing pass?
   
   The CDC process itself doesn't identify duplications in the parquet file 
itself, it only ensure that the columns are consistently chunked depending on 
the stream of data rather than a fixed number of records (or fixed page size in 
bytes). The actual deduplication is done by the storage system where these 
deduplication optimized parquet files are getting uploaded. A very simple 
in-memory content addressable storage system implementation is available in the 
[evaluation tool I have been 
using](https://github.com/kszucs/de/blob/main/src/store.rs).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to