[GitHub] [arrow] StuartHadfield commented on issue #36303: PyArrow Write Dataset Memory Consumption - How does it work?

via GitHub Tue, 27 Jun 2023 22:41:15 -0700


StuartHadfield commented on issue #36303:
URL: https://github.com/apache/arrow/issues/36303#issuecomment-1610773504


   @westonpace - thanks for  your response :) That makes me feel better - I 
couldn't for the life of me find anything that suggested magic!
   
   My observations, though somewhat early stages and ill-formed, are as follows:
   - Max row counts above ~75k will end up with a container of 2GB memory being 
OOM killed and my write process doesn't complete :(
   - I've managed to process large files with a 50k row limit, which seemed to 
top out memory usage at 1.3GB. (I guess this is because it's forced to flush 
files before it can read enough rows into memory to OOM the container?)
   - The container memory usage rapidly approaches and hovers at 1GB, increases 
maybe 100MB before I start writing files
   - My `file_visitor` function is called, which logs to stdout that I've 
written file(s)
   - Memory usage seems to drop in correlation with the file writes being 
logged, 100MB drop
   - Pattern repeats (mem usage increase, a bunch of files are written, memory 
usage drops).
   
   The write is occurring in a container within a pod in a kubernetes cluster, 
so I've just been measuring the memory usage by calling `kubectl top pod 
--containers` on a loop. I know this isn't a replacement for the standalone 
linux commands - but it proved enough of a point to me that memory usage 
approached my hard limit before my pod disappeared (I don't have logging on an 
OOM kill  yet).
   
   >  Each batch that arrives will be partitioned
   
   Is this in reference specifically to each RecordBatch that arrives to the 
`write_dataset` function?
   
   >  I don't know why you aren't seeing the files immediately
   
   My assumption was that pyarrow is keeping the files open until it's arrived 
at 500k rows per file.
   
   > These pages will get flushed to the disk eventually (e.g. when there is 
memory pressure) but that could even happen after the file is closed I believe.
   
   Right, so that makes sense, is it possible that the flush is not happening 
fast enough? I.e. the part that is reading rows into memory is happening "too 
fast" for us to flush to disk and alleviate some of said memory pressure?
   
   >  However, it (hopefully) shouldn't crash as it should realize that all of 
that dirty disk RAM is "freeable". 
   
   😄 That'd be all too convenient


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] StuartHadfield commented on issue #36303: PyArrow Write Dataset Memory Consumption - How does it work?

Reply via email to