rok commented on PR #45360:
URL: https://github.com/apache/arrow/pull/45360#issuecomment-2651021582

   Thanks for doing this @kszucs ! I like how this doesn't need any changes to 
readers.
   
   Questions:
   - As it stands in this PR, CDC is either on or off for all columns. How 
about enabling it per column? In general case some columns might not be worthy 
candidates for it.
   - Use case described in [HF 
blogpost](https://huggingface.co/blog/improve_parquet_dedupe) describes cases 
where rows are added or removed but not much else is changed. Wouldn't it then 
make sense to first try a shortcut deduplication where if we identify a 
duplication in the first column we first check for the same duplication at the 
same indices in all other columns before running a full hashing pass?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to