GitHub user zheniasigayev added a comment to the discussion: Best practices for memory-efficient deduplication of pre-sorted Parquet files
> If you could make a reproducer with synthetic data and file a ticket I would > be happy to look into this further I created a public Gist which you can find here: https://gist.github.com/zheniasigayev/2e5e471c9070cfa685d938bced47aa7f. I confirmed that the [2 queries](https://github.com/apache/datafusion/discussions/16776#discussioncomment-13780110) that I provided in the discussion above produced the same query plan, and memory consumers, when run against the generated parquet files. GitHub link: https://github.com/apache/datafusion/discussions/16776#discussioncomment-13837673 ---- This is an automatically sent email for github@datafusion.apache.org. To unsubscribe, please send an email to: github-unsubscr...@datafusion.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org