2012/9/11 Christian Jauvin <[email protected]>:
> Hi,
>
> I'm working on a text classification problem, and the strategy I'm
> currently studying is based on this example:
>
> http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html
>
> When I replace the data component by my own, I have found that the
> memory requirement explodes in a very spectacular way (whereas the
> same problem, outside of the GridSearchCV framework, works very fine,
> i.e. way inside my memory limit). At first I suspected the
> parallelization mechanism

Yes the current parallelization mechanism used when n_jobs != 1
triggers memory copies for each subprocess. Work is underway to fix
that by sharing memory between subprocesses for numerical input data
in https://github.com/joblib/joblib/pull/44.

However this might not work as expected when the original input is a
list of text documents: the replication will still occur as there is
no easy way to allocate a list of strings in shared memory a
posteriori.

For this case I would recommend to pas the list of filenames as input
to CountVectorizer by passing `input=filename` to the constructor and
changing your code to avoid loading the data in memory ahead of time:

http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

> The textual data fed to the Vectorizer being initially a list of
> strings, it gets converted to a Numpy array (with np.asarray) in this
> function. Although this conversion looks rather innocuous, it seems
> that in certain pathological conditions it does not behave as one
> would expect.
>
> Here is a small program that demonstrates the problem by simulating
> some textual data, once extracted:
>
> import os, random, resource, numpy as np
> x = [os.urandom(random.randint(50, 20000)) for i in range(30000)]
> print resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024. # ~303MB
> y = np.asarray(x)
> print resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024. # ~875MB

Well you still have the original x in memory so you should at least
expect a doubling of the memory. The remaining memory might be
temporary stuff allocated during the conversion although that seems
weird. You can add:

del x
import gc
gc.collect()

Anyway this simulation is probably not representative of your scenario
as integers can be unboxed in the array datastructure hence the copy
while string objects cannot and the array will only store references
to the original string objects.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to