[jira] [Created] (ARROW-15920) Memory usage RecordBatchStreamWriter

Stig Korsnes (Jira) Fri, 11 Mar 2022 12:27:05 -0800

Stig Korsnes created ARROW-15920:
------------------------------------

             Summary: Memory usage RecordBatchStreamWriter
                 Key: ARROW-15920
                 URL: https://issues.apache.org/jira/browse/ARROW-15920
             Project: Apache Arrow
          Issue Type: Wish
    Affects Versions: 7.0.0
         Environment: Windows 11 , Python 3.9.2
            Reporter: Stig Korsnes
         Attachments: demo.py, mem.png

Hi.

I have a monte-carlo calcuator that yields a couple of hundred Nx1 numpy
arrays. I need to develop further functionality on it, and since it can`t be
solved easily without having access to the full set I`m pursuing the route of
exporting them. Found PyArrow and got exited. First wall I hit, was that the
writer could not write "columns" (IPC). A stackoverflow post, and two weeks
later, I`m writing my arrays to single file-single column with a stream writer
,using write_table and chunksize (write_batch has no such parameter) .I`m then
combining all files to a single file by using a reader for every file and
reading batches. I then combine them to a single recordbatch and write. The
whole idea is that I can later pull in parts of the complete set/all columns
(which would fit in memory) and process further. Now, everything works, but
following along on my task manager, I see that memory simply skyrockets when I
write. I would expect memory consumption to stay around the size of my group
batches and then some. The whole point of this exercise is having stuff fit in
memory, and I can not see how I can achieve this. It makes me wonder if I`m a
complete idiot when I read
[efficiently-writing-and-reading-arrow-data|[https://arrow.apache.org/docs/python/ipc.html#efficiently-writing-and-reading-arrow-data],]
have I done something wrong or am I looking at it wrong? I have attached a
python file with a simple attempt. I have tried the filewriters, doing Tables
instead of batches and refactoring in all thinkable ways.

--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15920) Memory usage RecordBatchStreamWriter

Reply via email to