you are describing a special case where you know the data size apriori (eg not 
streaming), dtypes are readily apparent from a small sample case 
and in general your data is not messy 

I would agree if these can be satisfied then you can achieve closer to a 1x 
memory overhead

using bcolZ is great but prob not a realistic option for a dependency for numpy 
(you should prob just memory map it directly instead); though this has a big 
perf impact - so need to weigh these things

not all cases deserve the same treatment - chunking is often the best option 
IMHO - provides a constant memory usage (though ultimately still 2x); but 
combined with memory mapping can provide a fixed resource utilization 

> On Oct 26, 2014, at 9:41 AM, Daπid <davidmen...@gmail.com> wrote:
> 
> 
>> On 26 October 2014 12:54, Jeff Reback <jeffreb...@gmail.com> wrote:
>> you should have a read here/
>> http://wesmckinney.com/blog/?p=543
>> 
>> going below the 2x memory usage on read in is non trivial and costly in 
>> terms of performance 
> 
> 
> If you know in advance the number of rows (because it is in the header, 
> counted with wc -l, or any other prior information) you can preallocate the 
> array and fill in the numbers as you read, with virtually no overhead.
> 
> If the number of rows is unknown, an alternative is to use a chunked data 
> container like Bcolz [1] (former carray) instead of Python structures. It may 
> be used as such, or copied back to a ndarray if we want the memory to be 
> aligned. Including a bit of compression we can get the memory overhead to 
> somewhere under 2x (depending on the dataset), at the cost of not so much CPU 
> time, and this could be very useful for large data and slow filesystems. 
> 
> 
> /David.
> 
> [1] http://bcolz.blosc.org/
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to