Stig Korsnes created ARROW-15920:
------------------------------------
Summary: Memory usage RecordBatchStreamWriter
Key: ARROW-15920
URL: https://issues.apache.org/jira/browse/ARROW-15920
Project: Apache Arrow
Issue Type: Wish
Affects Versions: 7.0.0
Environment: Windows 11 , Python 3.9.2
Reporter: Stig Korsnes
Attachments: demo.py, mem.png
Hi.
I have a monte-carlo calcuator that yields a couple of hundred Nx1 numpy
arrays. I need to develop further functionality on it, and since it can`t be
solved easily without having access to the full set I`m pursuing the route of
exporting them. Found PyArrow and got exited. First wall I hit, was that the
writer could not write "columns" (IPC). A stackoverflow post, and two weeks
later, I`m writing my arrays to single file-single column with a stream writer
,using write_table and chunksize (write_batch has no such parameter) .I`m then
combining all files to a single file by using a reader for every file and
reading batches. I then combine them to a single recordbatch and write. The
whole idea is that I can later pull in parts of the complete set/all columns
(which would fit in memory) and process further. Now, everything works, but
following along on my task manager, I see that memory simply skyrockets when I
write. I would expect memory consumption to stay around the size of my group
batches and then some. The whole point of this exercise is having stuff fit in
memory, and I can not see how I can achieve this. It makes me wonder if I`m a
complete idiot when I read
[efficiently-writing-and-reading-arrow-data|[https://arrow.apache.org/docs/python/ipc.html#efficiently-writing-and-reading-arrow-data],]
have I done something wrong or am I looking at it wrong? I have attached a
python file with a simple attempt. I have tried the filewriters, doing Tables
instead of batches and refactoring in all thinkable ways.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)