StuartHadfield commented on issue #36303: URL: https://github.com/apache/arrow/issues/36303#issuecomment-1610773504
@westonpace - thanks for your response :) That makes me feel better - I couldn't for the life of me find anything that suggested magic! My observations, though somewhat early stages and ill-formed, are as follows: - Max row counts above ~75k will end up with a container of 2GB memory being OOM killed and my write process doesn't complete :( - I've managed to process large files with a 50k row limit, which seemed to top out memory usage at 1.3GB. (I guess this is because it's forced to flush files before it can read enough rows into memory to OOM the container?) - The container memory usage rapidly approaches and hovers at 1GB, increases maybe 100MB before I start writing files - My `file_visitor` function is called, which logs to stdout that I've written file(s) - Memory usage seems to drop in correlation with the file writes being logged, 100MB drop - Pattern repeats (mem usage increase, a bunch of files are written, memory usage drops). The write is occurring in a container within a pod in a kubernetes cluster, so I've just been measuring the memory usage by calling `kubectl top pod --containers` on a loop. I know this isn't a replacement for the standalone linux commands - but it proved enough of a point to me that memory usage approached my hard limit before my pod disappeared (I don't have logging on an OOM kill yet). > Each batch that arrives will be partitioned Is this in reference specifically to each RecordBatch that arrives to the `write_dataset` function? > I don't know why you aren't seeing the files immediately My assumption was that pyarrow is keeping the files open until it's arrived at 500k rows per file. > These pages will get flushed to the disk eventually (e.g. when there is memory pressure) but that could even happen after the file is closed I believe. Right, so that makes sense, is it possible that the flush is not happening fast enough? I.e. the part that is reading rows into memory is happening "too fast" for us to flush to disk and alleviate some of said memory pressure? > However, it (hopefully) shouldn't crash as it should realize that all of that dirty disk RAM is "freeable". 😄 That'd be all too convenient -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
