amol- commented on a change in pull request #10266: URL: https://github.com/apache/arrow/pull/10266#discussion_r631870514
########## File path: docs/source/python/memory.rst ########## @@ -277,6 +277,95 @@ types than with normal Python file objects. !rm example.dat !rm example2.dat +Efficiently Writing and Reading Arrow Arrays +-------------------------------------------- + +Being optimized for zero copy and memory mapped data, Arrow allows to easily +read and write arrays consuming the minimum amount of resident memory. + +When writing and reading raw arrow data, we can use the Arrow File Format +or the Arrow Streaming Format. + +To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file` +which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance +that can be used to write batches of data to that file. + +For example to write an array of 100M integers, we could write it in 1000 chunks +of 100000 entries: + +.. ipython:: python + + BATCH_SIZE = 100000 + NUM_BATCHES = 1000 + + schema = pa.schema([pa.field('nums', pa.int32())]) + + with pa.OSFile('bigfile.arrow', 'wb') as sink: + with pa.ipc.new_file(sink, schema) as writer: + for row in range(NUM_BATCHES): + batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema) + writer.write(batch) + +record batches support multiple columns, so in practice we always write the +equivalent of a :class:`~pyarrow.Table`. + +Writing in batches is effective because we in theory need to keep in memory only +the current batch we are writing. But when reading back, we can be even more effective +by directly mapping the data from disk and avoid allocating any new memory on read. + +Under normal conditions, reading back our file will consume a few hundred megabytes +of memory: + +.. ipython:: python + + with pa.OSFile('bigfile.arrow', 'rb') as source: + loaded_array = pa.ipc.open_file(source).read_all() + + print("LEN:", len(loaded_array)) + print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20)) + +To more efficiently read big data from disk, we can memory map the file, so that +the array can directly reference the data on disk and avoid copying it to memory. +In such case the memory consumption is greatly reduced and it's possible to read +arrays bigger than the total memory Review comment: I see what you mean. What I was trying to say is that Arrow doesn't have to allocate memory itself as it can directly point to the memory mapped buffer which allocation is managed by the system. Also the memory mapped buffer can be paged out more easily by the system without write back cost as it's not flagged as dirty memory, thus allowing to deal with files bigger than memory even in the absence of a swap file. I'll try to rephrase this in a less misleading way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org