[Numpy-discussion] Re: writing a known-size 1D ndarray serially as it's calced

Robert Kern Thu, 25 Aug 2022 08:12:41 -0700

On Thu, Aug 25, 2022 at 4:27 AM Bill Ross <bross_phobr...@sonic.net> wrote:


> Thanks, np.lib.format.open_memmap() works great! With prediction procs
> using minimal sys memory, I can get twice as many on GPU, with fewer
> optimization warnings.
>
> Why even have the number of records in the header? Shouldn't record size
> plus system-reported/growable file size be enough?
>
Only in the happy case where there is no corruption. Implicitness is not a
virtue in the use cases that the format was designed for. There is an
additional use case where the length is unknown a priori where implicitness
would help, but the format was not designed for that case (and I'm not sure
I want to add that use case).

> I'd love to have a shared-mem analog for smaller-scale data; now I load
> data and fork to emulate that effect.
>
There are a number of ways to do that, including using memmap on files on a
memory-backed filesystem like /dev/shm/ on Linux. See this article for
several more options:


https://luis-sena.medium.com/sharing-big-numpy-arrays-across-python-processes-abf0dc2a0ab2

> My file sizes will exceed memory, so I'm hoping to get the most out of
> memmap. Will this in-loop assignment to predsum work to avoid loading all
> to memory?
>
>     predsum = np.lib.format.open_memmap(outfile, mode='w+',
> shape=(ids_sq,), dtype=np.float32)
>
>     for i in range(len(IN_FILES)):
>
>         pred = numpy.lib.format.open_memmap(IN_FILES[i])
>
>         predsum = np.add(predsum, pred) ################# <-
>
This will replace the `predsum` array with a new in-memory array the first
time through this loop. Use `out=predsum` to make sure that the output goes
into the memory-mapped array

  np.add(predsum, pred, out=predsum)

Or the usual augmented assignment:

  predsum += pred

>         del pred
>     del predsum
>

The precise memory behavior will depend on your OS's virtual memory
configuration. But in general, `np.add()` will go through the arrays in
order, causing the virtual memory system to page in memory pages as they
are accessed for reading or writing, and page out the old ones to make room
for the new pages. Linux, in my experience, isn't always the best at
managing that backlog of old pages, especially if you have multiple
processes doing similar kinds of things (in the past, I have seen *each* of
those processes trying to use *all* of the main memory for their backlog of
old pages), but there are configuration tweaks that you can make.

-- 
Robert Kern

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: writing a known-size 1D ndarray serially as it's calced

Reply via email to