|
[CONF] Apache Mahout > Collection(De-)Serialization
Isabel Drost-Fromm (Confluence) Sun, 22 Dec 2013 13:59:57 -0800
- [CONF] Apache Mahout > Collection(De-)S... confluence
- [CONF] Apache Mahout > Collection(... Isabel Drost-Fromm (Confluence)
- [CONF] Apache Mahout > Collection(... Isabel Drost-Fromm (Confluence)
Collection(De-)Serialization 
Many "Big Data" datasets are very sparse.
Health data: most patients dont have most diseases.
NLP data: most documents dont have all 15k+ common words.
Graph data: most graphs are not fully connected.
and so on...
Far FEWER datasets are very dense.
Images come to mind, but that is already well addressed by OpenCV.
Proposal: use a serialization/deserialization strategy that allows for sparse matrix representation. They are both memory and computationally efficient. Here is an example of a sparse matrix implementation that is used very heavily for ML tasks:
http://www.mathworks.com/help/matlab/ref/sparse.html