Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT) Page: Collection(De-)Serialization (https://cwiki.apache.org/confluence/display/MAHOUT/Collection%28De-%29Serialization) Comment: https://cwiki.apache.org/confluence/display/MAHOUT/Collection%28De-%29Serialization?focusedCommentId=30759975#comment-30759975
Comment added by Andrew McMurry: --------------------------------------------------------------------- Many "Big Data" datasets are very sparse. Health data: most patients dont have most diseases. NLP data: most documents dont have all 15k+ common words. Graph data: most graphs are not fully connected. and so on... Far FEWER datasets are very dense. Images come to mind, but that is already well addressed by OpenCV. Proposal: use a serialization/deserialization strategy that allows for sparse matrix representation. They are both memory and computationally efficient. Here is an example of a sparse matrix implementation that is used very heavily for ML tasks: http://www.mathworks.com/help/matlab/ref/sparse.html Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
