you are describing a special case where you know the data size apriori (eg not streaming), dtypes are readily apparent from a small sample case and in general your data is not messy
I would agree if these can be satisfied then you can achieve closer to a 1x memory overhead using bcolZ is great but prob not a realistic option for a dependency for numpy (you should prob just memory map it directly instead); though this has a big perf impact - so need to weigh these things not all cases deserve the same treatment - chunking is often the best option IMHO - provides a constant memory usage (though ultimately still 2x); but combined with memory mapping can provide a fixed resource utilization > On Oct 26, 2014, at 9:41 AM, Daπid <davidmen...@gmail.com> wrote: > > >> On 26 October 2014 12:54, Jeff Reback <jeffreb...@gmail.com> wrote: >> you should have a read here/ >> http://wesmckinney.com/blog/?p=543 >> >> going below the 2x memory usage on read in is non trivial and costly in >> terms of performance > > > If you know in advance the number of rows (because it is in the header, > counted with wc -l, or any other prior information) you can preallocate the > array and fill in the numbers as you read, with virtually no overhead. > > If the number of rows is unknown, an alternative is to use a chunked data > container like Bcolz [1] (former carray) instead of Python structures. It may > be used as such, or copied back to a ndarray if we want the memory to be > aligned. Including a bit of compression we can get the memory overhead to > somewhere under 2x (depending on the dataset), at the cost of not so much CPU > time, and this could be very useful for large data and slow filesystems. > > > /David. > > [1] http://bcolz.blosc.org/ > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion