Hi all, The new tracemalloc infrastructure in python 3.4 is super-interesting to numerical folks, because we really like memory profiling. Numerical programs allocate a lot of memory, and sometimes it's not clear which operations allocate memory (some numpy operations return views of the original array without allocating anything; others return copies). So people actually use memory tracking tools[1], even though traditionally these have been pretty hacky (i.e., just checking RSS before and after each line is executed), and numpy has even grown its own little tracemalloc-like infrastructure [2], but it only works for numpy data.
BUT, we also really like calloc(). One of the basic array creation routines in numpy is numpy.zeros(), which returns an array full of -- you guessed it -- zeros. For pretty much all the data types numpy supports, the value zero is represented by the bytestring consisting of all zeros. So numpy.zeros() usually uses calloc() to allocate its memory. calloc() is more awesome than malloc()+memset() for two reasons. First, calloc() for larger allocations is usually implemented using clever VM tricks, so that it doesn't actually allocate any memory up front, it just creates a COW mapping of the system zero page and then does the actual allocation one page at a time as different entries are written to. This means that in the somewhat common case where you allocate a large array full of zeros, and then only set a few scattered entries to non-zero values, you can end up using much much less memory than otherwise. It's entirely possible for this to make the difference between being able to run an analysis versus not. memset() forces the whole amount of RAM to be committed immediately. Secondly, even if you *are* going to touch all the memory, then calloc() is still faster than malloc()+memset(). The reason is that for large allocations, malloc() usually does a calloc() no matter what -- when you get a new page from the kernel, the kernel has to make sure you can't see random bits of other processes's memory, so it unconditionally zeros out the page before you get to see it. calloc() knows this, so it doesn't bother zeroing it again. malloc()+memset(), by contrast, zeros the page twice, producing twice as much memory traffic, which is huge. SO, we'd like to route our allocations through PyMem_* in order to let tracemalloc "see" them, but because there is no PyMem_*Calloc, doing this would force us to give up on the calloc() optimizations. The obvious solution is to add a PyMem_*Calloc to the API. Would this be possible? Unfortunately it would require adding a new field to the PyMemAllocator struct, which would be an ABI/API break; PyMemAllocator is exposed directly in the C API and passed by value: https://docs.python.org/3.5/c-api/memory.html#c.PyMemAllocator (Too bad we didn't notice this a few months ago before 3.4 was released :-(.) I guess we could just rename the struct in 3.5, to force people to update their code. (I guess there aren't too many people who would have to update their code.) Thoughts? -n [1] http://scikit-learn.org/stable/developers/performance.html#memory-usage-profiling [2] https://github.com/numpy/numpy/pull/309 -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com