[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

GitBox Tue, 13 Jul 2021 03:02:17 -0700


amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r668616661




##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,97 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw Arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 
chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), 
type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory 
only
+the current batch we are writing. But when reading back, we can be even more 
effective
+by directly mapping the data from disk and avoid allocating any new memory on 
read.
+
+Under normal conditions, reading back our file will consume a few hundred 
megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so 
that
+the arrow can directly reference the data mapped from disk and avoid having to
+allocate its own memory.
+In such case the operating system will be able to page in the mapped memory
+lazily and page it out without any write back cost when under pressure,
+allowing to more easily read arrays bigger than the total memory.
+
+.. ipython:: python
+
+    with pa.memory_map('bigfile.arrow', 'r') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+Equally we can write back to disk the loaded array without consuming any
+extra memory thanks to the fact that iterating over the array will just
+scan through the data without the need to make copies of it

Review comment:
       @jorisvandenbossche the memory mapping only provides benefits if you 
don't alter data. If you applied any transformation to the data, the data 
wouldn't be equal to the one on disk anymore and thus you would lose all 
benefits of memory mapping because the kernel would no longer be able to use 
the memory mapped file instead of the swap file when in need to page out. I 
guess we can remove the "write back" section of the documentation if you think 
it doesn't provide much value.
   
   My primary goal was mostly to say "if you need to open a big ipc format 
file, open it using memory mapping or you will just face a OOM" the writing 
back section was to reinforce the concept but doesn't really add any additional 
value. I'm mostly interested in shipping an addition to the docs that documents 
that concept somewhere and I don't want to make perfect the enemy of good 
enough so I'm ok with deferring to other PRs any part that doesn't get obvious 
consensus.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Reply via email to