westonpace commented on issue #36303: URL: https://github.com/apache/arrow/issues/36303#issuecomment-1609808639
Hmm, there is no magic in write_dataset. We do not figure out how partition values ahead of time. We don't have any clever pocket to store things :) However, since you are not setting `min_rows_per_group` I would not expect `write_dataset` to do any buffering. Each batch that arrives will be partitioned. The groups will then be immediately written as row groups to the parquet files (this could be bad for read performance btw if these row groups are small). I don't know why you aren't seeing the files immediately. It's possible that Linux is simply hiding the files until the file descriptor is closed? How are you measuring RAM? One unfortunate fact is that "writes" don't actually write to the disk immediately. They simply copy the memory from user space to the kernel page cache (and mark the page dirty). These pages will get flushed to the disk eventually (e.g. when there is memory pressure) but that could even happen after the file is closed I believe. Unfortunately, I don't know how to tell Linux to only use a portion of RAM for disk caching. By default it aggressively tries to use every free byte in RAM before it starts pausing write calls. So I would expect a dataset write to fully consume RAM no matter what settings you try. However, it (hopefully) shouldn't crash as it should realize that all of that dirty disk RAM is "freeable". This is why I'm curious how you are measuring RAM. There was an attempt to add direct I/O a while back but it fell through and I'm not sure what the status of it is. Using direct I/O would alleviate memory pressure because the write calls would become blocking and would not return until the data was persisted to the disk. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
