> - concatenation of theses arrays into a single CSR array appears to be > non-tivial given the memory constraints (e.g. scipy.sparse.vstack > transforms all arrays to COO sparse representation internally).
There is a fast path for stacking a series of CSR matrices. On 6 June 2016 at 22:19, Roman Yurchak <[email protected]> wrote: > Dear all, > > I was wondering if somebody could advise on the best way for > generating/storing large sparse feature sets that do not fit in memory? > In particular, I have the following workflow, > > Large text dataset -> HashingVectorizer -> Feature set in a sparse CSR > array on disk -> Training a classifier -> Predictions > > where the the generated feature set is too large to fit in RAM, however > the classifier training can be done in one step (as it uses only certain > rows of the CSR array) and the prediction can be split in several steps, > all of which fit in memory. Since the training can be performed in one > step, I'm not looking for incremental learning out-of-core approaches > and saving features to disk for later processing is definitely useful. > > For instance, if it was possible to save the output of the > HashingVectorizer to a single file on disk (using e.g. joblib.dump) then > load this file as a memory map (using e.g. joblib.load(.., > mmap_mode='r')) everything would work great. Due to memory constraints > this cannot be done directly, and the best case scenario is applying > HashingVectorizer on chunks of the dataset, which produces a series of > sparse CSR arrays on disk. Then, > - concatenation of theses arrays into a single CSR array appears to be > non-tivial given the memory constraints (e.g. scipy.sparse.vstack > transforms all arrays to COO sparse representation internally). > - I was not able to find an abstraction layer that would allow to > represent these sparse arrays as a single array. For instance, dask > could allow to do this for dense arrays ( > http://dask.pydata.org/en/latest/array-stack.html ), however support for > sparse arrays is only planned at this point ( > https://github.com/dask/dask/issues/174 ). > Finally, it is not possible to pre-allocate the full array on disk in > advance (and access it as a memory map) because we don't know the number > of non-zero elements in the sparse array before running the feature > extraction. > > Of course, it is possible to overcome all these difficulties by using > a machine with more memory, but my point is rather to have a memory > efficient workflow. > > I would really appreciate any advice on this and would be happy to > contribute to a project in the scikit-learn environment aiming to > address similar issues, > > Thank you, > Best, > -- > Roman > > > > _______________________________________________ > scikit-learn mailing list > [email protected] > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
