[jira] [Updated] (ARROW-15920) Memory usage RecordBatchStreamWriter

Stig Korsnes (Jira) Fri, 11 Mar 2022 12:44:05 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-15920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Stig Korsnes updated ARROW-15920:
---------------------------------
    Description: 
Hi.

I have a monte-carlo calcuator that yields a couple of hundred Nx1 numpy 
arrays. I need to develop further functionality on it, and since it can`t be 
solved easily without having access to the full set I`m pursuing the route of 
exporting them. Found PyArrow and got exited. First wall I hit, was that the 
writer could not write "columns" (IPC). A stackoverflow post, and two weeks 
later, I`m writing my arrays to single file-single column with a stream writer 
,using write_table and chunksize (write_batch has no such parameter) .I`m then 
combining all files to a single file by using a reader for every file and 
reading the corresponding "part"-batches. I then combine them to a single 
recordbatch and write. The whole idea is that I can later pull in parts of the 
complete set/all columns (which would fit in memory) and  process further. Now, 
everything works, but following along on my task manager, I see that memory 
simply skyrockets when I write. I would expect memory consumption to stay 
around the size of my group batches and then some. The whole point of this 
exercise is having stuff fit in memory, and I can not see how I can achieve 
this. It makes me wonder if I`m a complete idiot when I read 
[#efficiently-writing-and-reading-arrow-data],] have I done something wrong or 
am I looking at it wrong? I have attached a python file with a simple attempt. 
I have tried the filewriters, doing Tables instead of batches and refactoring 
in all thinkable ways.

 

A snip:
 
{code:java}
readers = [pa.ipc.open_stream(file) for file in self.tempfiles]
combined_schema = pa.unify_schemas([r.schema for r in readers])

with pa.ipc.new_stream(os.path.join(self.path, self.outfile_name ),    
schema=combined_schema,) as writer:
    for group in zip(*readers):
        combined_batch = pa.RecordBatch.from_arrays(
            [g.column(0) for g in group], names=combined_schema.names)
        writer.write_batch(combined_batch){code}
 

  was:
Hi.

I have a monte-carlo calcuator that yields a couple of hundred Nx1 numpy 
arrays. I need to develop further functionality on it, and since it can`t be 
solved easily without having access to the full set I`m pursuing the route of 
exporting them. Found PyArrow and got exited. First wall I hit, was that the 
writer could not write "columns" (IPC). A stackoverflow post, and two weeks 
later, I`m writing my arrays to single file-single column with a stream writer 
,using write_table and chunksize (write_batch has no such parameter) .I`m then 
combining all files to a single file by using a reader for every file and 
reading the corresponding "part"-batches. I then combine them to a single 
recordbatch and write. The whole idea is that I can later pull in parts of the 
complete set/all columns (which would fit in memory) and  process further. Now, 
everything works, but following along on my task manager, I see that memory 
simply skyrockets when I write. I would expect memory consumption to stay 
around the size of my group batches and then some. The whole point of this 
exercise is having stuff fit in memory, and I can not see how I can achieve 
this. It makes me wonder if I`m a complete idiot when I read 
[#efficiently-writing-and-reading-arrow-data],] have I done something wrong or 
am I looking at it wrong? I have attached a python file with a simple attempt. 
I have tried the filewriters, doing Tables instead of batches and refactoring 
in all thinkable ways.

 

A snip:
 
{code:java}
readers = [pa.ipc.open_stream(file) for file in self.tempfiles]
combined_schema = pa.unify_schemas([r.schema for r in readers])

with pa.ipc.new_stream(    os.path.join(self.path, self.outfile_name ),    
schema=combined_schema,) as writer:
    for group in zip(*readers):
        combined_batch = pa.RecordBatch.from_arrays(
            [g.column(0) for g in group], names=combined_schema.names)
        writer.write_batch(combined_batch){code}
 


> Memory usage RecordBatchStreamWriter
> ------------------------------------
>
>                 Key: ARROW-15920
>                 URL: https://issues.apache.org/jira/browse/ARROW-15920
>             Project: Apache Arrow
>          Issue Type: Wish
>    Affects Versions: 7.0.0
>         Environment: Windows 11 , Python 3.9.2
>            Reporter: Stig Korsnes
>            Priority: Major
>         Attachments: demo.py, mem.png
>
>
> Hi.
> I have a monte-carlo calcuator that yields a couple of hundred Nx1 numpy 
> arrays. I need to develop further functionality on it, and since it can`t be 
> solved easily without having access to the full set I`m pursuing the route of 
> exporting them. Found PyArrow and got exited. First wall I hit, was that the 
> writer could not write "columns" (IPC). A stackoverflow post, and two weeks 
> later, I`m writing my arrays to single file-single column with a stream 
> writer ,using write_table and chunksize (write_batch has no such parameter) 
> .I`m then combining all files to a single file by using a reader for every 
> file and reading the corresponding "part"-batches. I then combine them to a 
> single recordbatch and write. The whole idea is that I can later pull in 
> parts of the complete set/all columns (which would fit in memory) and  
> process further. Now, everything works, but following along on my task 
> manager, I see that memory simply skyrockets when I write. I would expect 
> memory consumption to stay around the size of my group batches and then some. 
> The whole point of this exercise is having stuff fit in memory, and I can not 
> see how I can achieve this. It makes me wonder if I`m a complete idiot when I 
> read [#efficiently-writing-and-reading-arrow-data],] have I done something 
> wrong or am I looking at it wrong? I have attached a python file with a 
> simple attempt. I have tried the filewriters, doing Tables instead of batches 
> and refactoring in all thinkable ways.
>  
> A snip:
>  
> {code:java}
> readers = [pa.ipc.open_stream(file) for file in self.tempfiles]
> combined_schema = pa.unify_schemas([r.schema for r in readers])
> with pa.ipc.new_stream(os.path.join(self.path, self.outfile_name ),    
> schema=combined_schema,) as writer:
>     for group in zip(*readers):
>         combined_batch = pa.RecordBatch.from_arrays(
>             [g.column(0) for g in group], names=combined_schema.names)
>         writer.write_batch(combined_batch){code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15920) Memory usage RecordBatchStreamWriter

Reply via email to