Re: [Numpy-discussion] Change in memmap behaviour

2012-07-04 Thread Nathaniel Smith
On Tue, Jul 3, 2012 at 4:08 PM, Nathaniel Smith n...@pobox.com wrote:
 On Tue, Jul 3, 2012 at 10:35 AM, Thouis (Ray) Jones tho...@gmail.com wrote:
 On Mon, Jul 2, 2012 at 11:52 PM, Sveinung Gundersen svein...@gmail.com 
 wrote:

 On 2. juli 2012, at 22.40, Nathaniel Smith wrote:

 On Mon, Jul 2, 2012 at 6:54 PM, Sveinung Gundersen svein...@gmail.com 
 wrote:
 [snip]



 Your actual memory usage may not have increased as much as you think,
 since memmap objects don't necessarily take much memory -- it sounds
 like you're leaking virtual memory, but your resident set size
 shouldn't go up as much.


 As I understand it, memmap objects retain the contents of the memmap in
 memory after it has been read the first time (in a lazy manner). Thus, 
 when
 reading a slice of a 24GB file, only that part recides in memory. Our 
 system
 reads a slice of a memmap, calculates something (say, the sum), and then
 deletes the memmap. It then loops through this for consequitive slices,
 retaining a low memory usage. Consider the following code:

 import numpy as np
 res = []
 vecLen = 3095677412
 for i in xrange(vecLen/10**8+1):
 x = i * 10**8
 y = min((i+1) * 10**8, vecLen)
 res.append(np.memmap('val.float64', dtype='float64')[x:y].sum())

 The memory usage of this code on a 24GB file (one value for each 
 nucleotide
 in the human DNA!) is 23g resident memory after the loop is finished (not
 24g for some reason..).

 Running the same code on 1.5.1rc1 gives a resident memory of 23m after the
 loop.

 Your memory measurement tools are misleading you. The same memory is
 resident in both cases, just in one case your tools say it is
 operating system disk cache (and not attributed to your app), and in
 the other case that same memory, treated in the same way by the OS, is
 shown as part of your app's resident memory. Virtual memory is
 confusing...

 But the crucial difference is perhaps that the disk cache can be cleared by 
 the OS if needed, but not the application memory in the same way, which 
 must be swapped to disk? Or am I still confused?

 (snip)


 Great! Any idea on whether such a patch may be included in 1.7?

 Not really, if I or you or someone else gets inspired to take the time
 to write a patch soon then it will be, otherwise not...

 -N

 I have now tried to add a patch, in the way you proposed, but I may have 
 gotten it wrong..

 http://projects.scipy.org/numpy/ticket/2179

 I put this in a github repo, and added tests (author credit to Sveinung)
 https://github.com/thouis/numpy/tree/mmap_children

 I'm not sure which branch to issue a PR request against, though.

 Looks good to me, thanks to both of you!

 Obviously should be merged to master; beyond that I'm not sure. We
 definitely want it in 1.7, but I'm not sure if that's been branched
 yet or not. (Or rather, it has been branched, but then maybe it was
 unbranched again? Travis?) Since it was a 1.6 regression it'd make
 sense to cherrypick to the 1.6 branch too, just in case it gets
 another release.

Merged into master and maintenance/1.6.x, but not maintenance/1.7.x,
I'll let Ondrej or Travis figure that out...

-N
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Change in memmap behaviour

2012-07-03 Thread Thouis (Ray) Jones
On Mon, Jul 2, 2012 at 11:52 PM, Sveinung Gundersen svein...@gmail.com wrote:

 On 2. juli 2012, at 22.40, Nathaniel Smith wrote:

 On Mon, Jul 2, 2012 at 6:54 PM, Sveinung Gundersen svein...@gmail.com 
 wrote:
 [snip]



 Your actual memory usage may not have increased as much as you think,
 since memmap objects don't necessarily take much memory -- it sounds
 like you're leaking virtual memory, but your resident set size
 shouldn't go up as much.


 As I understand it, memmap objects retain the contents of the memmap in
 memory after it has been read the first time (in a lazy manner). Thus, when
 reading a slice of a 24GB file, only that part recides in memory. Our system
 reads a slice of a memmap, calculates something (say, the sum), and then
 deletes the memmap. It then loops through this for consequitive slices,
 retaining a low memory usage. Consider the following code:

 import numpy as np
 res = []
 vecLen = 3095677412
 for i in xrange(vecLen/10**8+1):
 x = i * 10**8
 y = min((i+1) * 10**8, vecLen)
 res.append(np.memmap('val.float64', dtype='float64')[x:y].sum())

 The memory usage of this code on a 24GB file (one value for each nucleotide
 in the human DNA!) is 23g resident memory after the loop is finished (not
 24g for some reason..).

 Running the same code on 1.5.1rc1 gives a resident memory of 23m after the
 loop.

 Your memory measurement tools are misleading you. The same memory is
 resident in both cases, just in one case your tools say it is
 operating system disk cache (and not attributed to your app), and in
 the other case that same memory, treated in the same way by the OS, is
 shown as part of your app's resident memory. Virtual memory is
 confusing...

 But the crucial difference is perhaps that the disk cache can be cleared by 
 the OS if needed, but not the application memory in the same way, which must 
 be swapped to disk? Or am I still confused?

 (snip)


 Great! Any idea on whether such a patch may be included in 1.7?

 Not really, if I or you or someone else gets inspired to take the time
 to write a patch soon then it will be, otherwise not...

 -N

 I have now tried to add a patch, in the way you proposed, but I may have 
 gotten it wrong..

 http://projects.scipy.org/numpy/ticket/2179

I put this in a github repo, and added tests (author credit to Sveinung)
https://github.com/thouis/numpy/tree/mmap_children

I'm not sure which branch to issue a PR request against, though.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Change in memmap behaviour

2012-07-03 Thread Nathaniel Smith
On Tue, Jul 3, 2012 at 10:35 AM, Thouis (Ray) Jones tho...@gmail.com wrote:
 On Mon, Jul 2, 2012 at 11:52 PM, Sveinung Gundersen svein...@gmail.com 
 wrote:

 On 2. juli 2012, at 22.40, Nathaniel Smith wrote:

 On Mon, Jul 2, 2012 at 6:54 PM, Sveinung Gundersen svein...@gmail.com 
 wrote:
 [snip]



 Your actual memory usage may not have increased as much as you think,
 since memmap objects don't necessarily take much memory -- it sounds
 like you're leaking virtual memory, but your resident set size
 shouldn't go up as much.


 As I understand it, memmap objects retain the contents of the memmap in
 memory after it has been read the first time (in a lazy manner). Thus, when
 reading a slice of a 24GB file, only that part recides in memory. Our 
 system
 reads a slice of a memmap, calculates something (say, the sum), and then
 deletes the memmap. It then loops through this for consequitive slices,
 retaining a low memory usage. Consider the following code:

 import numpy as np
 res = []
 vecLen = 3095677412
 for i in xrange(vecLen/10**8+1):
 x = i * 10**8
 y = min((i+1) * 10**8, vecLen)
 res.append(np.memmap('val.float64', dtype='float64')[x:y].sum())

 The memory usage of this code on a 24GB file (one value for each nucleotide
 in the human DNA!) is 23g resident memory after the loop is finished (not
 24g for some reason..).

 Running the same code on 1.5.1rc1 gives a resident memory of 23m after the
 loop.

 Your memory measurement tools are misleading you. The same memory is
 resident in both cases, just in one case your tools say it is
 operating system disk cache (and not attributed to your app), and in
 the other case that same memory, treated in the same way by the OS, is
 shown as part of your app's resident memory. Virtual memory is
 confusing...

 But the crucial difference is perhaps that the disk cache can be cleared by 
 the OS if needed, but not the application memory in the same way, which must 
 be swapped to disk? Or am I still confused?

 (snip)


 Great! Any idea on whether such a patch may be included in 1.7?

 Not really, if I or you or someone else gets inspired to take the time
 to write a patch soon then it will be, otherwise not...

 -N

 I have now tried to add a patch, in the way you proposed, but I may have 
 gotten it wrong..

 http://projects.scipy.org/numpy/ticket/2179

 I put this in a github repo, and added tests (author credit to Sveinung)
 https://github.com/thouis/numpy/tree/mmap_children

 I'm not sure which branch to issue a PR request against, though.

Looks good to me, thanks to both of you!

Obviously should be merged to master; beyond that I'm not sure. We
definitely want it in 1.7, but I'm not sure if that's been branched
yet or not. (Or rather, it has been branched, but then maybe it was
unbranched again? Travis?) Since it was a 1.6 regression it'd make
sense to cherrypick to the 1.6 branch too, just in case it gets
another release.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Change in memmap behaviour

2012-07-02 Thread Nathaniel Smith
On Mon, Jul 2, 2012 at 3:53 PM, Sveinung Gundersen svein...@gmail.com wrote:
 Hi,

 We are developing a large project for genome analysis
 (http://hyperbrowser.uio.no), where we use memmap vectors as the basic data
 structure for storage. The stored data are accessed in slices, and used as
 basis for calculations. As the stored data may be large (up to 24 GB), the
 memory footprint is important.

 We experienced a problem with 64-bit addressing for the function concatenate
 (using quite old numpy version 1.5.1rc), and have thus updated the version
 of numpy to 1.7.0.dev-651ef74, where the problem has been fixed. We have,
 however, experienced another problem connected to a change in memmap
 behaviour. This change seems to have come with the 1.6 release.

 Before (1.5.1rc1):

 import platform; print platform.python_version()
 2.7.0
 import numpy as np
 np.version.version
 '1.5.1rc1'
 a = np.memmap('testmemmap', 'int32', 'w+', shape=20)
 a[:] = 2
 a[0:2]
 memmap([2, 2], dtype=int32)
 a[0:2]._mmap
 mmap.mmap object at 0x3c246f8
 a.sum()
 40
 a.sum()._mmap
 Traceback (most recent call last):
   File stdin, line 1, in module
 AttributeError: 'numpy.int64' object has no attribute '_mmap'

 After (1.6.2):

 import platform; print platform.python_version()
 2.7.0
 import numpy as np
 np.version.version
 '1.6.2'
 a = np.memmap('testmemmap', 'int32', 'w+', shape=20)
 a[:] = 2
 a[0:2]
 memmap([2, 2], dtype=int32)
 a[0:2]._mmap
 mmap.mmap object at 0x1b82ed50
 a.sum()
 memmap(40)
 a.sum()._mmap
 mmap.mmap object at 0x1b82ed50

 The problem is then that doing calculations of memmap objects, resulting in
 scalar results, previously returned a numpy scalar, with no reference to the
 memmap object. We could then just keep the result, and mark the memmap for
 garbage collection. Now, the memory usage of the system has increased
 dramatically, as we now longer have this option.

Your actual memory usage may not have increased as much as you think,
since memmap objects don't necessarily take much memory -- it sounds
like you're leaking virtual memory, but your resident set size
shouldn't go up as much.

That said, this is clearly a bug, and it's even worse than you mention
-- *all* operations on memmap arrays are holding onto references to
the original mmap object, regardless of whether they share any memory:
   a = np.memmap(/etc/passwd, np.uint8, r)
  # arithmetic
   (a + 10)._mmap is a._mmap
  True
  # fancy indexing (doesn't return a view!)
   a[[1, 2, 3]]._mmap is a._mmap
  True
   a.sum()._mmap is a._mmap
  True
Really, only slicing should be returning a np.memmap object at all.
Unfortunately, it is currently impossible to create an ndarray
subclass that returns base-class ndarrays from any operations --
__array_finalize__() has no way to do this. And this is the third
ndarray subclass in a row that I've looked at that wanted to be able
to do this, so I guess maybe it's something we should implement...

In the short term, the numpy-upstream fix is to change
numpy.core.memmap:memmap.__array_finalize__ so that it only copies
over the ._mmap attribute of its parent if np.may_share_memory(self,
parent) is True. Patches gratefully accepted ;-)

In the short term, you have a few options for hacky workarounds. You
could monkeypatch the above fix into the memmap class. You could
manually assign None to the _mmap attribute of offending arrays (being
careful only to do this to arrays where you know it is safe!). And for
reduction operations like sum() in particular, what you have right now
is not actually a scalar object -- it is a 0-dimensional array that
holds a single scalar. You can pull this scalar out by calling .item()
on the array, and then throw away the array itself -- the scalar won't
have any _mmap attribute.
  def scalarify(scalar_or_0d_array):
if isinstance(scalar_or_0d_array, np.ndarray):
  return scalar_or_0d_array.item()
else:
  return scalar_or_0d_array
  # works on both numpy 1.5 and numpy 1.6:
  total = scalarify(a.sum())

-N
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Change in memmap behaviour

2012-07-02 Thread Sveinung Gundersen
[snip]

 
 Your actual memory usage may not have increased as much as you think,
 since memmap objects don't necessarily take much memory -- it sounds
 like you're leaking virtual memory, but your resident set size
 shouldn't go up as much.

As I understand it, memmap objects retain the contents of the memmap in memory 
after it has been read the first time (in a lazy manner). Thus, when reading a 
slice of a 24GB file, only that part recides in memory. Our system reads a 
slice of a memmap, calculates something (say, the sum), and then deletes the 
memmap. It then loops through this for consequitive slices, retaining a low 
memory usage. Consider the following code:

import numpy as np
res = []
vecLen = 3095677412
for i in xrange(vecLen/10**8+1): 
x = i * 10**8
y = min((i+1) * 10**8, vecLen)
res.append(np.memmap('val.float64', dtype='float64')[x:y].sum())

The memory usage of this code on a 24GB file (one value for each nucleotide in 
the human DNA!) is 23g resident memory after the loop is finished (not 24g for 
some reason..).

Running the same code on 1.5.1rc1 gives a resident memory of 23m after the loop.

 
 That said, this is clearly a bug, and it's even worse than you mention
 -- *all* operations on memmap arrays are holding onto references to
 the original mmap object, regardless of whether they share any memory:
 a = np.memmap(/etc/passwd, np.uint8, r)
  # arithmetic
 (a + 10)._mmap is a._mmap
  True
  # fancy indexing (doesn't return a view!)
 a[[1, 2, 3]]._mmap is a._mmap
  True
 a.sum()._mmap is a._mmap
  True
 Really, only slicing should be returning a np.memmap object at all.
 Unfortunately, it is currently impossible to create an ndarray
 subclass that returns base-class ndarrays from any operations --
 __array_finalize__() has no way to do this. And this is the third
 ndarray subclass in a row that I've looked at that wanted to be able
 to do this, so I guess maybe it's something we should implement...
 
 In the short term, the numpy-upstream fix is to change
 numpy.core.memmap:memmap.__array_finalize__ so that it only copies
 over the ._mmap attribute of its parent if np.may_share_memory(self,
 parent) is True. Patches gratefully accepted ;-)

Great! Any idea on whether such a patch may be included in 1.7?

 
 In the short term, you have a few options for hacky workarounds. You
 could monkeypatch the above fix into the memmap class. You could
 manually assign None to the _mmap attribute of offending arrays (being
 careful only to do this to arrays where you know it is safe!). And for
 reduction operations like sum() in particular, what you have right now
 is not actually a scalar object -- it is a 0-dimensional array that
 holds a single scalar. You can pull this scalar out by calling .item()
 on the array, and then throw away the array itself -- the scalar won't
 have any _mmap attribute.
  def scalarify(scalar_or_0d_array):
if isinstance(scalar_or_0d_array, np.ndarray):
  return scalar_or_0d_array.item()
else:
  return scalar_or_0d_array
  # works on both numpy 1.5 and numpy 1.6:
  total = scalarify(a.sum())

Thank you for this! However, such a solution would have to be scattered 
throughout the code (probably over 100 places), and I would rather not do that. 
I guess the abovementioned patch would be the best solution. I do not have 
experience in the numpy core code, so I am also eagerly awaiting such a patch!

Sveinung

--
Sveinung Gundersen
PhD Student, Bioinformatics, Dept. of Tumor Biology, Inst. for Cancer Research, 
The Norwegian Radium Hospital, Montebello, 0310 Oslo, Norway
E-mail: sveinung.gunder...@medisin.uio.no, Phone: +47 93 00 94 54


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Change in memmap behaviour

2012-07-02 Thread Nathaniel Smith
On Mon, Jul 2, 2012 at 6:54 PM, Sveinung Gundersen svein...@gmail.com wrote:
 [snip]



 Your actual memory usage may not have increased as much as you think,
 since memmap objects don't necessarily take much memory -- it sounds
 like you're leaking virtual memory, but your resident set size
 shouldn't go up as much.


 As I understand it, memmap objects retain the contents of the memmap in
 memory after it has been read the first time (in a lazy manner). Thus, when
 reading a slice of a 24GB file, only that part recides in memory. Our system
 reads a slice of a memmap, calculates something (say, the sum), and then
 deletes the memmap. It then loops through this for consequitive slices,
 retaining a low memory usage. Consider the following code:

 import numpy as np
 res = []
 vecLen = 3095677412
 for i in xrange(vecLen/10**8+1):
 x = i * 10**8
 y = min((i+1) * 10**8, vecLen)
 res.append(np.memmap('val.float64', dtype='float64')[x:y].sum())

 The memory usage of this code on a 24GB file (one value for each nucleotide
 in the human DNA!) is 23g resident memory after the loop is finished (not
 24g for some reason..).

 Running the same code on 1.5.1rc1 gives a resident memory of 23m after the
 loop.

Your memory measurement tools are misleading you. The same memory is
resident in both cases, just in one case your tools say it is
operating system disk cache (and not attributed to your app), and in
the other case that same memory, treated in the same way by the OS, is
shown as part of your app's resident memory. Virtual memory is
confusing...

 That said, this is clearly a bug, and it's even worse than you mention
 -- *all* operations on memmap arrays are holding onto references to
 the original mmap object, regardless of whether they share any memory:

 a = np.memmap(/etc/passwd, np.uint8, r)

  # arithmetic

 (a + 10)._mmap is a._mmap

  True
  # fancy indexing (doesn't return a view!)

 a[[1, 2, 3]]._mmap is a._mmap

  True

 a.sum()._mmap is a._mmap

  True
 Really, only slicing should be returning a np.memmap object at all.
 Unfortunately, it is currently impossible to create an ndarray
 subclass that returns base-class ndarrays from any operations --
 __array_finalize__() has no way to do this. And this is the third
 ndarray subclass in a row that I've looked at that wanted to be able
 to do this, so I guess maybe it's something we should implement...

 In the short term, the numpy-upstream fix is to change
 numpy.core.memmap:memmap.__array_finalize__ so that it only copies
 over the ._mmap attribute of its parent if np.may_share_memory(self,
 parent) is True. Patches gratefully accepted ;-)


 Great! Any idea on whether such a patch may be included in 1.7?

Not really, if I or you or someone else gets inspired to take the time
to write a patch soon then it will be, otherwise not...

-N
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Change in memmap behaviour

2012-07-02 Thread Sveinung Gundersen

On 2. juli 2012, at 22.40, Nathaniel Smith wrote:

 On Mon, Jul 2, 2012 at 6:54 PM, Sveinung Gundersen svein...@gmail.com wrote:
 [snip]
 
 
 
 Your actual memory usage may not have increased as much as you think,
 since memmap objects don't necessarily take much memory -- it sounds
 like you're leaking virtual memory, but your resident set size
 shouldn't go up as much.
 
 
 As I understand it, memmap objects retain the contents of the memmap in
 memory after it has been read the first time (in a lazy manner). Thus, when
 reading a slice of a 24GB file, only that part recides in memory. Our system
 reads a slice of a memmap, calculates something (say, the sum), and then
 deletes the memmap. It then loops through this for consequitive slices,
 retaining a low memory usage. Consider the following code:
 
 import numpy as np
 res = []
 vecLen = 3095677412
 for i in xrange(vecLen/10**8+1):
 x = i * 10**8
 y = min((i+1) * 10**8, vecLen)
 res.append(np.memmap('val.float64', dtype='float64')[x:y].sum())
 
 The memory usage of this code on a 24GB file (one value for each nucleotide
 in the human DNA!) is 23g resident memory after the loop is finished (not
 24g for some reason..).
 
 Running the same code on 1.5.1rc1 gives a resident memory of 23m after the
 loop.
 
 Your memory measurement tools are misleading you. The same memory is
 resident in both cases, just in one case your tools say it is
 operating system disk cache (and not attributed to your app), and in
 the other case that same memory, treated in the same way by the OS, is
 shown as part of your app's resident memory. Virtual memory is
 confusing...

But the crucial difference is perhaps that the disk cache can be cleared by the 
OS if needed, but not the application memory in the same way, which must be 
swapped to disk? Or am I still confused?

(snip)

 
 Great! Any idea on whether such a patch may be included in 1.7?
 
 Not really, if I or you or someone else gets inspired to take the time
 to write a patch soon then it will be, otherwise not...
 
 -N

I have now tried to add a patch, in the way you proposed, but I may have gotten 
it wrong..

http://projects.scipy.org/numpy/ticket/2179

Sveinung
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion