Dear all,

I was wondering if somebody could advise on the best way for
generating/storing large sparse feature sets that do not fit in memory?
In particular, I have the following workflow,

Large text dataset -> HashingVectorizer -> Feature set in a sparse CSR
array on disk -> Training a classifier -> Predictions

where the the generated feature set is too large to fit in RAM, however
the classifier training can be done in one step (as it uses only certain
rows of the CSR array) and the prediction can be split in several steps,
all of which fit in memory. Since the training can be performed in one
step, I'm not looking for incremental learning out-of-core approaches
and saving features to disk for later processing is definitely useful.

For instance, if it was possible to save the output of the
HashingVectorizer to a single file on disk (using e.g. joblib.dump) then
load this file as a memory map (using e.g. joblib.load(..,
mmap_mode='r')) everything would work great. Due to memory constraints
this cannot be done directly, and the best case scenario is applying
HashingVectorizer on chunks of the dataset, which produces a series of
sparse CSR arrays on disk. Then,
 - concatenation of theses arrays into a single CSR array appears to be
non-tivial given the memory constraints (e.g. scipy.sparse.vstack
transforms all arrays to COO sparse representation internally).
 - I was not able to find an abstraction layer that would allow to
represent these sparse arrays as a single array. For instance, dask
could allow to do this for dense arrays (
http://dask.pydata.org/en/latest/array-stack.html ), however support for
sparse arrays is only planned at this point (
https://github.com/dask/dask/issues/174 ).
  Finally, it is not possible to pre-allocate the full array on disk in
advance (and access it as a memory map) because we don't know the number
of non-zero elements in the sparse array before running the feature
extraction.

  Of course, it is possible to overcome all these difficulties by using
a machine with more memory, but my point is rather to have a memory
efficient workflow.

  I would really appreciate any advice on this and would be happy to
contribute to a project in the scikit-learn environment aiming to
address similar issues,

Thank you,
Best,
-- 
Roman



_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to