2012/9/11 Olivier Grisel <[email protected]>:
>
> Anyway this simulation is probably not representative of your scenario
> as integers can be unboxed in the array datastructure hence the copy
> while string objects cannot and the array will only store references
> to the original string objects.

Actually I was wrong: np.asarray(list_of_strings) is allocating a
continous numpy array of size n_samples * max(len(s) for s in
list_of_strings).

On the plus side: it would make it possible to use shared memory if
the array conversion is done prior to calling the parallelization
hence benefit from https://github.com/joblib/joblib/pull/44

On the minus side, if the longest string in the list is much larger
than the median (which can occur frequently in practice), this
representation is wasting a lot of memory, hence the observed memory
explosion. We should probably set dtype=np.object_ to avoid the
unboxing of the strings in the cross validation code and maybe in some
parts of the vectorizer code as well.

I will open an issue to track this.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to