[CONF] Apache Mahout > Collection(De-)Serialization

confluence Fri, 29 Mar 2013 14:29:26 -0700

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Collection(De-)Serialization 
(https://cwiki.apache.org/confluence/display/MAHOUT/Collection%28De-%29Serialization)
Comment: 
https://cwiki.apache.org/confluence/display/MAHOUT/Collection%28De-%29Serialization?focusedCommentId=30759975#comment-30759975


Comment added by Andrew McMurry:
---------------------------------------------------------------------

Many "Big Data" datasets are very sparse. 

Health data: most patients dont have most diseases. 
NLP data: most documents dont have all 15k+ common words. 
Graph data: most graphs are not fully connected. 
and so on...

Far FEWER datasets are very dense. 
Images come to mind, but that is already well addressed by OpenCV. 

Proposal: use a serialization/deserialization strategy that allows for sparse 
matrix representation. They are both memory and computationally efficient. Here 
is an example of a sparse matrix implementation that is used very heavily for 
ML tasks:  

http://www.mathworks.com/help/matlab/ref/sparse.html


Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action

[CONF] Apache Mahout > Collection(De-)Serialization

Reply via email to