The Apache Mahout PMC is pleased to announce the release of Mahout 0.8. Mahout's goal is to build scalable machine learning libraries focused primarily in the areas of collaborative filtering (recommenders), clustering and classification (known collectively as the "3Cs"), as well as the necessary infrastructure to support those implementations including, but not limited to, math packages for statistics, linear algebra and others as well as Java primitive collections, local and distributed vector and matrix classes and a variety of integrative code to work with popular packages like Apache Hadoop, Apache Lucene, Apache HBase, Apache Cassandra and much more. The 0.8 release is mainly a clean up release in preparation for an upcoming 1.0 release, but there are several significant new features, which are highlighted below.
To get started with Apache Mahout 0.8, download the release artifacts and signatures at http://www.apache.org/dyn/closer.cgi/mahout or visit the central Maven repository. In addition to the release highlights and artifacts, please pay attention to the section labelled FUTURE PLANS below for more information about upcoming releases of Mahout. As with any release, we wish to thank all of the users and contributors to Mahout. Please see the CHANGELOG [1] and JIRA Release Notes [2] for individual credits, as there are too many to list here. GETTING STARTED In the release package, the examples directory contains several working examples of the core functionality available in Mahout. These can be run via scripts in the examples/bin directory and will prompt you for more information to help you try things out. Most examples do not need a Hadoop cluster in order to run. RELEASE HIGHLIGHTS The highlights of the Apache Mahout 0.8 release include, but are not limited to the list below. For further information, see the included CHANGELOG file. - Numerous performance improvements to Vector and Matrix implementations, API's and their iterators (see also MAHOUT-1192, MAHOUT-1202) - Numerous performance improvements to the recommender implementations (see also MAHOUT-1272, MAHOUT-1035, MAHOUT-1042, MAHOUT-1151, MAHOUT-1166, MAHOUT-1167, MAHOUT-1169, MAHOUT-1205, MAHOUT-1264) - MAHOUT-1088: Support for biased item-based recommender - MAHOUT-1089: SGD matrix factorization for rating prediction with user and item biases - MAHOUT-1106: Support for SVD++ - MAHOUT-944: Support for converting one or more Lucene storage indexes to SequenceFiles as well as an upgrade of the supported Lucene version to Lucene 4.3.1. - MAHOUT-1154 and friends: New streaming k-means implementation that offers on-line (and fast) clustering - MAHOUT-833: Make conversion to SequenceFiles Map-Reduce, 'seqdirectory' can now be run as a MapReduce job. - MAHOUT-1052: Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values). - MAHOUT-884: Matrix Concat utility, presently only concatenates two matrices. - MAHOUT-1244: Upgraded to use Lucene 4.3 - MAHOUT-1187: Upgraded to CommonsLang3 - MAHOUT-916: Speedup the Mahout build by making tests run in parallel. - The usual bug fixes. See JIRA [2] for more information on the 0.8 release. A total of 218 separate JIRA issues are addressed in this release. CONTRIBUTING Mahout is always looking for contributions focused on the 3Cs. If you are interested in contributing, please see our contribution page, https://cwiki.apache.org/MAHOUT/how-to-contribute.html, on the Mahout wiki or contact us via email at d...@mahout.apache.org. FUTURE PLANS 0.9 As the project moves towards a 1.0 release, the community is working to clean up and/or remove parts of the code base that are under-supported or that underperform as well as to better focus the energy and contributions on key algorithms that are proven to scale in production and have seen wide-spread adoption. To this end, in the next release, the project is planning on removing support for the following algorithms unless there is sustained support and improvement of them before the next release. The algorithms to be removed are: - From Clustering: Dirichlet MeanShift MinHash Eigencuts - From Classification (both are sequential implementations) Winnow Perceptron - Frequent Pattern Mining - Collaborative Filtering All recommenders in org.apache.mahout.cf.taste. impl.recommender.knn SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and org.apache.mahout.cf.taste.impl.recommender.slopeone Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender - Mahout Math Lanczos in favour of SSVD Hadoop entropy stuff in org.apache.mahout.math.stats.entropy If you are interested in supporting 1 or more of these algorithms, please make it known on d...@mahout.apache.org and via JIRA issues that fix and/or improve them. Please also provide supporting evidence as to their effectiveness for you in production. 1.0 PLANS Our plans as a community are to focus 0.9 on cleanup of bugs and the removal of the code mentioned above and then to follow with a 1.0 release soon thereafter, at which point the community is committing to the support of the algorithms packaged in the 1.0 for at least two minor versions after their release. In the case of removal after 1.0, we will deprecate the functionality in the 1.(x+1) minor release and remove it in the 1.(x+2) release. For instance, if feature X is to be removed after the 1.2 release, it will be deprecated in 1.3 and removed in 1.4. [1] http://svn.apache.org/viewvc/mahout/trunk/CHANGELOG?revision=1501110&view=markup [2] https://issues.apache.org/jira/issues/?jql=project%20%3D%20MAHOUT%20AND%20fixVersion%20%3D%20%220.8%22]