[CONF] Apache Mahout > Collection(De-)Serialization

Isabel Drost-Fromm (Confluence) Sun, 22 Dec 2013 13:59:57 -0800

	Isabel Drost-Fromm removed a comment from the page:
	Collection(De-)Serialization

Many "Big Data" datasets are very sparse.

Health data: most patients dont have most diseases.
NLP data: most documents dont have all 15k+ common words.
Graph data: most graphs are not fully connected.
and so on...

Far FEWER datasets are very dense.
Images come to mind, but that is already well addressed by OpenCV.

Proposal: use a serialization/deserialization strategy that allows for sparse matrix representation. They are both memory and computationally efficient. Here is an example of a sparse matrix implementation that is used very heavily for ML tasks:

http://www.mathworks.com/help/matlab/ref/sparse.html

Stop watching space · Manage Notifications

[CONF] Apache Mahout > Collection(De-)Serialization

Collection(De-)Serialization

Reply via email to