hveiga commented on issue #11042: URL: https://github.com/apache/datafusion/issues/11042#issuecomment-2232227943
I finally have some time to continue investigating this issue. I have not been able to make heaptrack work (yet!) but I did try using [dhat](https://docs.rs/dhat/latest/dhat/) and I got an interesting lead: <img width="1505" alt="Screenshot 2024-07-16 at 7 22 55 PM" src="https://github.com/user-attachments/assets/0743cf9f-de70-465d-b67f-72090e863bfe"> I won't claim I am experienced with this tool but I was curious why it was highlighting `dict_encoder.rs`. When writing parquet from a `COPY` query the option `DICTIONARY_ENABLED` is enabled by default. I decided to give a try to disable it using `DICTIONARY_ENABLED false`. After disabling it I don't see the memory increasing pattern anymore, it only increases marginally for every invocation (in the 100-200MB range) while with `DICTIONARY_ENABLED true` each invocation increases the memory usage in multiple GBs (2-3GB) and it seems it never gets freed again. I don't have a root cause of the issue yet but wanted to share this behavior in case somebody else might find this pattern familiar. I also found https://github.com/apache/arrow-rs/issues/5828 which might be related. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org