hveiga commented on issue #11042:
URL: https://github.com/apache/datafusion/issues/11042#issuecomment-2232227943

   I finally have some time to continue investigating this issue. I have not 
been able to make heaptrack work (yet!) but I did try using 
[dhat](https://docs.rs/dhat/latest/dhat/) and I got an interesting lead:
   
   <img width="1505" alt="Screenshot 2024-07-16 at 7 22 55 PM" 
src="https://github.com/user-attachments/assets/0743cf9f-de70-465d-b67f-72090e863bfe";>
   
   I won't claim I am experienced with this tool but I was curious why it was 
highlighting `dict_encoder.rs`. When writing parquet from a `COPY` query the 
option `DICTIONARY_ENABLED` is enabled by default. I decided to give a try to 
disable it using `DICTIONARY_ENABLED false`. 
   
   After disabling it I don't see the memory increasing pattern anymore, it 
only increases marginally for every invocation (in the 100-200MB range) while 
with `DICTIONARY_ENABLED true` each invocation increases the memory usage in 
multiple GBs (2-3GB) and it seems it never gets freed again. 
   
   I don't have a root cause of the issue yet but wanted to share this behavior 
in case somebody else might find this pattern familiar. I also found 
https://github.com/apache/arrow-rs/issues/5828 which might be related. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to