u3Izx9ql7vW4 commented on issue #43929:
URL: https://github.com/apache/arrow/issues/43929#issuecomment-2325536409
That's correct, I'm writing to IPC with memory mapped files. I did go over
the page you linked a few times but couldn't figure out the difference between
`new_file` and `new_stream`, as they both accept `NativeFile` . I needed to
have multiple IPCs streams running, so I opted for memory mapped files which
allowed me to designate a file path for each producer. Could this be done with
Arrows' memory buffer somehow?
> If you just want to write to disk and keep the file memory mapped it's
likely easier (and faster) to just write an arrow file to disk and mmap it
after.
I don't quite follow this part. Isn't this what I'm already doing? Perhaps
you're suggesting that I check if the mem map file has already been created
before creating a new one, like below?
```python
def save_data():
size = table.get_total_buffer_size()
file_path = os.path.join(prefix_stream, sink)
if not os.path.exists(file_path):
pa.create_memory_map(file_path, size)
with pa.memory_map(file_path, 'wb') as sink:
with pa.ipc.new_file(sink, table.schema) as writer:
writer.write_table(table, max_chunksize=1000)
```
> edit: actually the offset is likely metadata, maybe?
That's what I thought as well, but I would have thought the
`total_buffer_size` would have included the metadata. Though I may have
dictionary encoded columns, so maybe that's inflating the metadata. Do you know
if there's a way to find out?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]