Dear all, I was wondering if somebody could advise on the best way for generating/storing large sparse feature sets that do not fit in memory? In particular, I have the following workflow,
Large text dataset -> HashingVectorizer -> Feature set in a sparse CSR array on disk -> Training a classifier -> Predictions where the the generated feature set is too large to fit in RAM, however the classifier training can be done in one step (as it uses only certain rows of the CSR array) and the prediction can be split in several steps, all of which fit in memory. Since the training can be performed in one step, I'm not looking for incremental learning out-of-core approaches and saving features to disk for later processing is definitely useful. For instance, if it was possible to save the output of the HashingVectorizer to a single file on disk (using e.g. joblib.dump) then load this file as a memory map (using e.g. joblib.load(.., mmap_mode='r')) everything would work great. Due to memory constraints this cannot be done directly, and the best case scenario is applying HashingVectorizer on chunks of the dataset, which produces a series of sparse CSR arrays on disk. Then, - concatenation of theses arrays into a single CSR array appears to be non-tivial given the memory constraints (e.g. scipy.sparse.vstack transforms all arrays to COO sparse representation internally). - I was not able to find an abstraction layer that would allow to represent these sparse arrays as a single array. For instance, dask could allow to do this for dense arrays ( http://dask.pydata.org/en/latest/array-stack.html ), however support for sparse arrays is only planned at this point ( https://github.com/dask/dask/issues/174 ). Finally, it is not possible to pre-allocate the full array on disk in advance (and access it as a memory map) because we don't know the number of non-zero elements in the sparse array before running the feature extraction. Of course, it is possible to overcome all these difficulties by using a machine with more memory, but my point is rather to have a memory efficient workflow. I would really appreciate any advice on this and would be happy to contribute to a project in the scikit-learn environment aiming to address similar issues, Thank you, Best, -- Roman _______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
