Stig Korsnes created ARROW-15920:
------------------------------------

             Summary: Memory usage RecordBatchStreamWriter
                 Key: ARROW-15920
                 URL: https://issues.apache.org/jira/browse/ARROW-15920
             Project: Apache Arrow
          Issue Type: Wish
    Affects Versions: 7.0.0
         Environment: Windows 11 , Python 3.9.2
            Reporter: Stig Korsnes
         Attachments: demo.py, mem.png

Hi.

I have a monte-carlo calcuator that yields a couple of hundred Nx1 numpy 
arrays. I need to develop further functionality on it, and since it can`t be 
solved easily without having access to the full set I`m pursuing the route of 
exporting them. Found PyArrow and got exited. First wall I hit, was that the 
writer could not write "columns" (IPC). A stackoverflow post, and two weeks 
later, I`m writing my arrays to single file-single column with a stream writer 
,using write_table and chunksize (write_batch has no such parameter) .I`m then 
combining all files to a single file by using a reader for every file and 
reading batches. I then combine them to a single recordbatch and write. The 
whole idea is that I can later pull in parts of the complete set/all columns 
(which would fit in memory) and  process further. Now, everything works, but 
following along on my task manager, I see that memory simply skyrockets when I 
write. I would expect memory consumption to stay around the size of my group 
batches and then some. The whole point of this exercise is having stuff fit in 
memory, and I can not see how I can achieve this. It makes me wonder if I`m a 
complete idiot when I read 
[efficiently-writing-and-reading-arrow-data|[https://arrow.apache.org/docs/python/ipc.html#efficiently-writing-and-reading-arrow-data],]
 have I done something wrong or am I looking at it wrong? I have attached a 
python file with a simple attempt. I have tried the filewriters, doing Tables 
instead of batches and refactoring in all thinkable ways.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to