Re: [I] Potential memory issue when using COPY with PARTITIONED BY [datafusion]

via GitHub Tue, 16 Jul 2024 19:29:56 -0700


hveiga commented on issue #11042:
URL: https://github.com/apache/datafusion/issues/11042#issuecomment-2232227943


   I finally have some time to continue investigating this issue. I have not 
been able to make heaptrack work (yet!) but I did try using 
[dhat](https://docs.rs/dhat/latest/dhat/) and I got an interesting lead:
   
   <img width="1505" alt="Screenshot 2024-07-16 at 7 22 55 PM" 
src="https://github.com/user-attachments/assets/0743cf9f-de70-465d-b67f-72090e863bfe";>
   
   I won't claim I am experienced with this tool but I was curious why it was 
highlighting `dict_encoder.rs`. When writing parquet from a `COPY` query the 
option `DICTIONARY_ENABLED` is enabled by default. I decided to give a try to 
disable it using `DICTIONARY_ENABLED false`. 
   
   After disabling it I don't see the memory increasing pattern anymore, it 
only increases marginally for every invocation (in the 100-200MB range) while 
with `DICTIONARY_ENABLED true` each invocation increases the memory usage in 
multiple GBs (2-3GB) and it seems it never gets freed again. 
   
   I don't have a root cause of the issue yet but wanted to share this behavior 
in case somebody else might find this pattern familiar. I also found 
https://github.com/apache/arrow-rs/issues/5828 which might be related. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Potential memory issue when using COPY with PARTITIONED BY [datafusion]

Reply via email to