Hi,

I'm working on a text classification problem, and the strategy I'm
currently studying is based on this example:

http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html

When I replace the data component by my own, I have found that the
memory requirement explodes in a very spectacular way (whereas the
same problem, outside of the GridSearchCV framework, works very fine,
i.e. way inside my memory limit). At first I suspected the
parallelization mechanism, but after a long debugging session, I
finally narrowed down the problem to the check_arrays function in the
validation.py module.

The textual data fed to the Vectorizer being initially a list of
strings, it gets converted to a Numpy array (with np.asarray) in this
function. Although this conversion looks rather innocuous, it seems
that in certain pathological conditions it does not behave as one
would expect.

Here is a small program that demonstrates the problem by simulating
some textual data, once extracted:

import os, random, resource, numpy as np
x = [os.urandom(random.randint(50, 20000)) for i in range(30000)]
print resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024. # ~303MB
y = np.asarray(x)
print resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024. # ~875MB

It doesn't make sense that np.asarray should almost triple the memory
consumption, doesn't it? (With my real data, it's way worse, but I
cannot seem to replicate it with a simulation).

Thanks,

Christian

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to