2012/9/11 Olivier Grisel <[email protected]>: > > Anyway this simulation is probably not representative of your scenario > as integers can be unboxed in the array datastructure hence the copy > while string objects cannot and the array will only store references > to the original string objects.
Actually I was wrong: np.asarray(list_of_strings) is allocating a continous numpy array of size n_samples * max(len(s) for s in list_of_strings). On the plus side: it would make it possible to use shared memory if the array conversion is done prior to calling the parallelization hence benefit from https://github.com/joblib/joblib/pull/44 On the minus side, if the longest string in the list is much larger than the median (which can occur frequently in practice), this representation is wasting a lot of memory, hence the observed memory explosion. We should probably set dtype=np.object_ to avoid the unboxing of the strings in the cross validation code and maybe in some parts of the vectorizer code as well. I will open an issue to track this. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
