amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r668602824
##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
!rm example.dat
!rm example2.dat
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000
chunks
+of 100000 entries:
+
+.. ipython:: python
+
+ BATCH_SIZE = 100000
+ NUM_BATCHES = 1000
+
+ schema = pa.schema([pa.field('nums', pa.int32())])
+
+ with pa.OSFile('bigfile.arrow', 'wb') as sink:
+ with pa.ipc.new_file(sink, schema) as writer:
+ for row in range(NUM_BATCHES):
+ batch = pa.record_batch([pa.array(range(BATCH_SIZE),
type=pa.int32())], schema)
+ writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory
only
+the current batch we are writing. But when reading back, we can be even more
effective
+by directly mapping the data from disk and avoid allocating any new memory on
read.
+
+Under normal conditions, reading back our file will consume a few hundred
megabytes
+of memory:
+
+.. ipython:: python
+
+ with pa.OSFile('bigfile.arrow', 'rb') as source:
+ loaded_array = pa.ipc.open_file(source).read_all()
+
+ print("LEN:", len(loaded_array))
+ print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so
that
+the array can directly reference the data on disk and avoid copying it to
memory.
+In such case the memory consumption is greatly reduced and it's possible to
read
+arrays bigger than the total memory
Review comment:
there are benefits because if you are only reading the data (suppose to
compute means or whatever on it) if you are using memory mapping and you have
to read more data than it fits in your memory, the kernel can swap out the
pages no longer in use without any cost of writing them into the swap, because
they are already available in the file that was memory mapped and thus can be
paged back in directly from the memory mapping.
On the other side, if you were relying on swap and had read the file
normally, when the data you have to read doesn't fit into memory, the kernel
will have to incur into the cost of writing it to the swap, otherwise it would
be unable to page it out as there wouldn't be any copy (as far as the memory
manager is concerned) that would allow to page it back in.
So memory mapping prevents the cost of _writing to the swap file_ when you
are exhausting memory.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]