Hi, I'm working on a text classification problem, and the strategy I'm currently studying is based on this example:
http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html When I replace the data component by my own, I have found that the memory requirement explodes in a very spectacular way (whereas the same problem, outside of the GridSearchCV framework, works very fine, i.e. way inside my memory limit). At first I suspected the parallelization mechanism, but after a long debugging session, I finally narrowed down the problem to the check_arrays function in the validation.py module. The textual data fed to the Vectorizer being initially a list of strings, it gets converted to a Numpy array (with np.asarray) in this function. Although this conversion looks rather innocuous, it seems that in certain pathological conditions it does not behave as one would expect. Here is a small program that demonstrates the problem by simulating some textual data, once extracted: import os, random, resource, numpy as np x = [os.urandom(random.randint(50, 20000)) for i in range(30000)] print resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024. # ~303MB y = np.asarray(x) print resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024. # ~875MB It doesn't make sense that np.asarray should almost triple the memory consumption, doesn't it? (With my real data, it's way worse, but I cannot seem to replicate it with a simulation). Thanks, Christian ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
