Hello,
The Hugging Face developers published this insightful blog post about their attempts to deduplicate Parquet files when they have similar contents. They offer a couple suggestions for improvement at the end: https://huggingface.co/blog/improve_parquet_dedupe Regards Antoine.