[Numpy-discussion] Change in memmap behaviour

Sveinung Gundersen Mon, 02 Jul 2012 07:53:55 -0700

Hi,

We are developing a large project for genome analysis 
(http://hyperbrowser.uio.no), where we use memmap vectors as the basic data 
structure for storage. The stored data are accessed in slices, and used as 
basis for calculations. As the stored data may be large (up to 24 GB), the 
memory footprint is important.


We experienced a problem with 64-bit addressing for the function concatenate 
(using quite old numpy version 1.5.1rc), and have thus updated the version of 
numpy to 1.7.0.dev-651ef74, where the problem has been fixed. We have, however, 
experienced another problem connected to a change in memmap behaviour. This 
change seems to have come with the 1.6 release.

Before (1.5.1rc1):

>>> import platform; print platform.python_version()
2.7.0
>>> import numpy as np
>>> np.version.version
'1.5.1rc1'
>>> a = np.memmap('testmemmap', 'int32', 'w+', shape=20)
>>> a[:] = 2
>>> a[0:2]
memmap([2, 2], dtype=int32)
>>> a[0:2]._mmap
<mmap.mmap object at 0x3c246f8>
>>> a.sum()
40
>>> a.sum()._mmap
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'numpy.int64' object has no attribute '_mmap'

After (1.6.2):
>>> import platform; print platform.python_version()
2.7.0
>>> import numpy as np
>>> np.version.version
'1.6.2'
>>> a = np.memmap('testmemmap', 'int32', 'w+', shape=20)
>>> a[:] = 2
>>> a[0:2]
memmap([2, 2], dtype=int32)
>>> a[0:2]._mmap
<mmap.mmap object at 0x1b82ed50>
>>> a.sum()
memmap(40)
>>> a.sum()._mmap
<mmap.mmap object at 0x1b82ed50>

The problem is then that doing calculations of memmap objects, resulting in 
scalar results, previously returned a numpy scalar, with no reference to the 
memmap object. We could then just keep the result, and mark the memmap for 
garbage collection. Now, the memory usage of the system has increased 
dramatically, as we now longer have this option.

So, the question is twofold:

1) What is the reason behind this change? It makes sense to keep the reference 
to the mmap when slicing, but to go from a scalar value to the mmap does not 
seem very useful. Is there a possibility to return to the old solution?
2) If not, do you have any advice how we can retain the old solution without 
rewriting the system. We could cast the results of all functions on the memmap, 
but these are scattered throughout the system and would probably cause much 
headache. So we would rather implement a general solution, for instance 
wrapping the memmap object somehow. Do you have any ideas?

Connected to this is the rather puzzling fact that the 'new' memmap scalar 
object has an __iter__ method, but no length. Should not the __iter__ method be 
removed, as this signals that the object is iterable?

Before (1.5.1rc1):
>>> a[0:2].__iter__()
<iterator object at 0x3c22b10>
>>> len(a[0:2])
2
>>> a.sum().__iter__
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'numpy.int64' object has no attribute '__iter__'
>>> len(a.sum())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type 'numpy.int64' has no len()

After (1.6.2):
>>> a[0:2].__iter__()
<iterator object at 0x1b7befd0>
>>> len(a[0:2])          
2
>>> a.sum().__iter__
<method-wrapper '__iter__' of memmap object at 0x1b7cab18>
>>> len(a.sum())        
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: len() of unsized object
>>> [x for x in a.sum()]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: iteration over a 0-d array

Regards,
Sveinung Gundersen

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Change in memmap behaviour

Reply via email to