2012/9/11 Christian Jauvin <[email protected]>: > Hi, > > I'm working on a text classification problem, and the strategy I'm > currently studying is based on this example: > > http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html > > When I replace the data component by my own, I have found that the > memory requirement explodes in a very spectacular way (whereas the > same problem, outside of the GridSearchCV framework, works very fine, > i.e. way inside my memory limit). At first I suspected the > parallelization mechanism
Yes the current parallelization mechanism used when n_jobs != 1 triggers memory copies for each subprocess. Work is underway to fix that by sharing memory between subprocesses for numerical input data in https://github.com/joblib/joblib/pull/44. However this might not work as expected when the original input is a list of text documents: the replication will still occur as there is no easy way to allocate a list of strings in shared memory a posteriori. For this case I would recommend to pas the list of filenames as input to CountVectorizer by passing `input=filename` to the constructor and changing your code to avoid loading the data in memory ahead of time: http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html > The textual data fed to the Vectorizer being initially a list of > strings, it gets converted to a Numpy array (with np.asarray) in this > function. Although this conversion looks rather innocuous, it seems > that in certain pathological conditions it does not behave as one > would expect. > > Here is a small program that demonstrates the problem by simulating > some textual data, once extracted: > > import os, random, resource, numpy as np > x = [os.urandom(random.randint(50, 20000)) for i in range(30000)] > print resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024. # ~303MB > y = np.asarray(x) > print resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024. # ~875MB Well you still have the original x in memory so you should at least expect a doubling of the memory. The remaining memory might be temporary stuff allocated during the conversion although that seems weird. You can add: del x import gc gc.collect() Anyway this simulation is probably not representative of your scenario as integers can be unboxed in the array datastructure hence the copy while string objects cannot and the array will only store references to the original string objects. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
