[GitHub] [arrow] westonpace commented on issue #36303: PyArrow Write Dataset Memory Consumption - How does it work?

via GitHub Tue, 27 Jun 2023 08:58:15 -0700


westonpace commented on issue #36303:
URL: https://github.com/apache/arrow/issues/36303#issuecomment-1609808639


   Hmm, there is no magic in write_dataset.  We do not figure out how partition 
values ahead of time.  We don't have any clever pocket to store things :)
   
   However, since you are not setting `min_rows_per_group` I would not expect 
`write_dataset` to do any buffering.  Each batch that arrives will be 
partitioned.  The groups will then be immediately written as row groups to the 
parquet files (this could be bad for read performance btw if these row groups 
are small).  I don't know why you aren't seeing the files immediately.  It's 
possible that Linux is simply hiding the files until the file descriptor is 
closed?
   
   How are you measuring RAM?  One unfortunate fact is that "writes" don't 
actually write to the disk immediately.  They simply copy the memory from user 
space to the kernel page cache (and mark the page dirty).  These pages will get 
flushed to the disk eventually (e.g. when there is memory pressure) but that 
could even happen after the file is closed I believe.
   
   Unfortunately, I don't know how to tell Linux to only use a portion of RAM 
for disk caching.  By default it aggressively tries to use every free byte in 
RAM before it starts pausing write calls.  So I would expect a dataset write to 
fully consume RAM no matter what settings you try.  However, it (hopefully) 
shouldn't crash as it should realize that all of that dirty disk RAM is 
"freeable".  This is why I'm curious how you are measuring RAM.
   
   There was an attempt to add direct I/O a while back but it fell through and 
I'm not sure what the status of it is.  Using direct I/O would alleviate memory 
pressure because the write calls would become blocking and would not return 
until the data was persisted to the disk.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #36303: PyArrow Write Dataset Memory Consumption - How does it work?

Reply via email to