mahout git commit: MAHOUT-1853: Add new thresholds and partitioning methods to SimilarityAnalysis

pat Tue, 13 Sep 2016 13:02:55 -0700

Repository: mahout
Updated Branches:
  refs/heads/master 3351b75b3 -> b5fe4aab2



MAHOUT-1853: Add new thresholds and partitioning methods to SimilarityAnalysis


Project: http://git-wip-us.apache.org/repos/asf/mahout/repo
Commit: http://git-wip-us.apache.org/repos/asf/mahout/commit/b5fe4aab
Tree: http://git-wip-us.apache.org/repos/asf/mahout/tree/b5fe4aab
Diff: http://git-wip-us.apache.org/repos/asf/mahout/diff/b5fe4aab

Branch: refs/heads/master
Commit: b5fe4aab22e7867ae057a6cdb1610cfa17555311
Parents: 3351b75
Author: pferrel <[email protected]>
Authored: Tue Sep 13 13:02:14 2016 -0700
Committer: pferrel <[email protected]>
Committed: Tue Sep 13 13:02:14 2016 -0700

----------------------------------------------------------------------
 CHANGELOG                                       | 627 -------------------
 .../mahout/math/cf/SimilarityAnalysis.scala     | 192 +++++-
 .../mahout/cf/SimilarityAnalysisSuite.scala     | 125 +++-
 3 files changed, 272 insertions(+), 672 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mahout/blob/b5fe4aab/CHANGELOG
----------------------------------------------------------------------
diff --git a/CHANGELOG b/CHANGELOG
deleted file mode 100644
index 5cd8af5..0000000
--- a/CHANGELOG
+++ /dev/null
@@ -1,627 +0,0 @@
-Mahout Change Log
-
-Release 0.12.0 - unreleased
-
-  MAHOUT-1775: FileNotFoundException caused by aborting the process of 
downloading Wikipedia dataset (Bowei Zhang via smarthi)
-
-  MAHOUT-1771: Cluster dumper omits indices and 0 elements for dense vector or 
sparse containing 0s (srowen)
-
-  MAHOUT-1613: classifier.df.tools.Describe does not handle -D parameters 
(haohui mai via smarthi)
-
-  MAHOUT-1642: Iterator class within SimilarItems class always misses the 
first element (Oleg Zotov via smarthi)
-
-  MAHOUT-1675: Remove MLP from codebase (ZJaffe via smarthi)
-
-Release 0.11.0 - 2015-08-07
-
-  MAHOUT-1744: Deprecate lucene2seq (apalumbo)
-
-  MAHOUT-1761: Upgraded to Apache parent pom v17 (sslavic)
-
-  MAHOUT-1745: Purge deprecated ConcatVectorsJob from codebase (apalumbo)
-
-  MAHOUT-1757: small fix in spca formula (smarthi)
-
-  MAHOUT-1756: Missing +=: and *=: operators on vectors (smarthi)
-
-  NOJIRA: Clean up CLI help for spark-rowsimilarity and fixed test that 
intermitently failed (pferrel)
-
-  MAHOUT-1685: Move Mahout shell to Spark 1.3+ (dlyubimov, apalumbo)
-
-  MAHOUT-1653: Spark 1.3 (pferrel, apalumbo)
-
-  MAHOUT-1754: Distance and squared distance matrices routines (dlyubimov)
-    
-  MAHOUT-1753: First and second moment routines (dlyubimov)
-    
-  MAHOUT-1746: mxA ^ 2, mxA ^ 0.5 to mean the same thing as mxA * mxA and mxA 
::= sqrt _ (dlyubimov)
-
-  MAHOUT-1736: Implement allreduceBlock() on H2O (avati)
-    
-  MAHOUT-1752: Implement CbindScalar operator on H2O (avati)
-  
-  MAHOUT-1660: Hadoop1HDFSUtil.readDRMHEader should be taking Hadoop conf 
(dlyubimov)
-
-  MAHOUT-1713: Performance and parallelization improvements for AB', A'B, A'A 
spark physical operators (dlyubimov)
-
-  MAHOUT-1714: Add MAHOUT_OPTS environment when running Spark shell (dlyubimov)
-
-  MAHOUT-1715: Closeable API for broadcast tensors (dlyubimov)
-
-  MAHOUT-1716: Scala logging style (dlyubimov)
-
-  MAHOUT-1717: allreduceBlock() operator api and Spark implementation 
(dlyubimov)
-
-  MAHOUT-1718: Support for conversion of any type-keyed DRM into 
ordinally-keyed DRM (dlyubimov)
-
-  MAHOUT-1719: Unary elementwise function operator and function fusions 
(dlyubimov)
-
-  MAHOUT-1720: Support 1 cbind X, X cbind 1 etc. for both Matrix and DRM 
(dlyubimov)
-
-  MAHOUT-1721: rowSumsMap() summary for non-int-keyed DRMs (dlyubimov)
-
-  MAHOUT-1722: DRM row sampling api (dlyubimov)
-
-  MAHOUT-1723: Optional structural "flavor" abstraction for in-core matrices 
(dlyubimov)
-
-  MAHOUT-1724: Optimizations of matrix-matrix in-core multiplication based on 
structural flavors (dlyubimov)
-
-  MAHOUT-1725: elementwise power operator ^ (dlyubimov)
-
-  MAHOUT-1726: R-like vector concatenation operator (dlyubimov)
-
-  MAHOUT-1727: Elementwise analogues of scala.math functions for tensor types 
(dlyubimov)
-
-  MAHOUT-1728: In-core functional assignments (dlyubimov)
-
-  MAHOUT-1729: Straighten out behavior of Matrix.iterator() and 
iterateNonEmpty() (dlyubimov)
-
-  MAHOUT-1730: New mutable transposition view for in-core matrices (dlyubimov)
-
-  MAHOUT-1731: Deprecate SparseColumnMatrix (dlyubimov)
-
-  MAHOUT-1732: Native support for kryo serialization of tensor types 
(dlyubimov)
-
-Release 0.10.1 - 2015-05-31
-
-  MAHOUT-1704: Pare down dependency jar for h2o (apalumbo)
-
-  MAHOUT-1697: Fixed paths to which math-scala and spark modules docs get 
packaged under in bin distribution archive (sslavic)
-
-  MAHOUT-1696: QRDecomposition.solve(...) can return incorrect Matrix types 
(apalumbo)
-
-  MAHOUT-1690: CLONE - Some vector dumper flags are expecting arguments. 
(smarthi)
-
-  MAHOUT-1693: FunctionalMatrixView materializes row vectors in scala shell 
(apalumbo)
-
-  MAHOUT-1680: Renamed mahout-distribution to apache-mahout-distribution 
(sslavic)
-
-Release 0.10.0 - 2015-04-11
-
-  MAHOUT-1630: Incorrect SparseColumnMatrix.numSlices() causes IndexException 
in toString() (Oleg Nitz, smarthi)
-
-  MAHOUT-1665: Update hadoop commands in example scripts (akm)
-
-  MAHOUT-1676: Deprecate MLP, ConcatenateVectorsJob and 
ConcatenateVectorsReducer in the codebase (apalumbo)
-
-  MAHOUT-1622: MultithreadedBatchItemSimilarities outputs incorrect number of 
similarities (Jesse Daniels, Anand Avati via smarthi)
-
-  MAHOUT-1605: Make VisualizerTest locale independent (Frank Rosner, Anand 
Avati via smarthi)
-
-  MAHOUT-1635: Getting an exception when I provide classification labels 
manually for Naive Bayes (apalumbo)
-
-  MAHOUT-1662: Potential Path bug in SequenceFileVaultIterator breaks 
DisplaySpectralKMeans (Shannon Quinn)
-
-  MAHOUT-1656: Change SNAPSHOT version from 1.0 to 0.10.0 (smarthi)
-
-  MAHOUT-1593: cluster-reuters.sh does not work complaining 
java.lang.IllegalStateException (smarthi via akm)
-
-  MAHOUT-1661: All Lanczos modules marked as @Deprecated and slated for 
removal in future releases (Shannon Quinn)
-
-  MAHOUT-1638: H2O bindings fail at drmParallelizeWithRowLabels(...) (Anand 
Avati via apalumbo)
-
-  MAHOUT-1667: Hadoop 1 and 2 profile in POM (sslavic)
-
-  MAHOUT-1564: Naive Bayes Classifier for New Text Documents (apalumbo)
-
-  MAHOUT-1524: Script to auto-generate and view the Mahout website on a local 
machine (Saleem Ansari via apalumbo)
-
-  MAHOUT-1589: Deprecate mahout.cmd due to lack of support
-
-  MAHOUT-1655: Refactors mr-legacy into mahout-hdfs and mahout-mr, Spark now 
depends on much reduced mahout-hdfs
-
-  MAHOUT-1522: Handle logging levels via log4j.xml (akm)
-
-  MAHOUT-1602: Euclidean Distance Similarity Math (Leonardo Fernandez Sanchez, 
smarthi)
-
-  MAHOUT-1619: HighDFWordsPruner overwrites cache files (Burke Webster, 
smarthi)
-
-  MAHOUT-1516: classify-20newsgroups.sh failed: 
/tmp/mahout-work-jpan/20news-all does not exists in hdfs. (Jian Pan via 
apalumbo)
-
-  MAHOUT-1559: Add documentation for and clean up the wikipedia classifier 
example (apalumbo)
-
-  MAHOUT-1598: extend seq2sparse to handle multiple text blocks of same 
document (Wolfgang Buchnere via akm)
-
-  MAHOUT-1659: Remove deprecated Lanczos solver from spectral clustering in 
mr-legacy (Shannon Quinn)
-
-  MAHOUT-1612: NullPointerException happens during JSON output format for 
clusterdumper (smarthi, Manoj Awasthi)
-
-  MAHOUT-1652: Java 7 update (smarthi)
-
-  MAHOUT-1639: Streaming kmeans doesn't properly validate 
estimatedNumMapClusters -km (smarthi)
-
-  MAHOUT-1493: Port Naive Bayes to Scala DSL (apalumbo)
-
-  MAHOUT-1611: Preconditions.checkArgument in 
org.apache.mahout.utils.ConcatenateVectorsJob (Haishou Ma via smarthi)
-
-  MAHOUT-1615: SparkEngine drmFromHDFS returning the same Key for all Key,Vec 
Pairs for Text-Keyed SequenceFiles (Anand Avati, dlyubimov, apalumbo)
-
-  MAHOUT-1610: Update tests to pass in Java 8 (srowen)
-
-  MAHOUT-1608: Add option in WikipediaToSequenceFile to remove category labels 
from documents (apalumbo)
-
-  MAHOUT-1604: Spark version of rowsimilarity driver and associated additions 
to SimilarityAnalysis.scala (pferrel)
-
-  MAHOUT-1500: H2O Integration (Anand Avati via apalumbo)
-
-  MAHOUT-1606 - Add rowSums, rowMeans and diagonal extraction operations to 
distributed matrices (dlyubimov)
-
-  MAHOUT-1603: Tweaks for Spark 1.0.x (dlyubimov & pferrel)
-
-  MAHOUT-1596: implement rbind() operator (Anand Avati and dlyubimov)
-
-  MAHOUT-1597: A + 1.0 (element-wise scala operation) gives wrong result if 
rdd is missing rows, Spark side (dlyubimov)
-
-  MAHOUT-1595: MatrixVectorView - implement a proper iterateNonZero() (Anand 
Avati via dlyubimov)
-
-  MAHOUT-1590 Mahout unit test failures due to guava version conflict on 
hadoop 2 (Venkat Ranganathan via sslavic)
-
-  MAHOUT-1529(e): Move dense/sparse matrix test in mapBlock into spark (Anand 
Avati via dlyubimov)
-
-  MAHOUT-1583: cbind() operator for Scala DRMs (dlyubimov)
-
-  MAHOUT-1563: Eliminated warnings about multiple scala versions (sslavic)
-
-  MAHOUT-1541, MAHOUT-1568, MAHOUT-1569: Created text-delimited file I/O 
traits and classes on spark, a MahoutDriver for a CLI and a 
ItemSimilairtyDriver using the CLI
-
-  MAHOUT-1573: More explicit parallelism adjustments in math-scala DRM apis; 
elements of automatic parallelism management (dlyubimov)
-
-  MAHOUT-1580: Optimize getNumNonZeroElements() (ssc)
-
-  MAHOUT-1464: Cooccurrence Analysis on Spark (pat)
-
-  MAHOUT-1578: Optimizations in matrix serialization (ssc)
-
-  MAHOUT-1572: blockify() to detect (naively) the data sparsity in the loaded 
data (dlyubimov)
-
-  MAHOUT-1571: Functional Views are not serialized as dense/sparse correctly 
(dlyubimov)
-
-  MAHOUT-1566: (Experimental) Regular ALS factorizer with conversion tests, 
optimizer enhancements and bug fixes (dlyubimov)
-
-  MAHOUT-1537: Minor fixes to spark-shell (Anand Avati via dlyubimov)
-
-  MAHOUT-1529: Finalize abstraction of distributed logical plans from backend 
operations (dlyubimov)
-
-  MAHOUT-1489: Interactive Scala & Spark Bindings Shell & Script processor 
(dlyubimov)
-
-  MAHOUT-1346: Spark Bindings (DRM) (dlyubimov)
-
-  MAHOUT-1555: Exception thrown when a test example has the label not present 
in training examples (Karol Grzegorczyk via smarthi)
-
-  MAHOUT-1446: Create an intro for matrix factorization (Jian Wang via ssc)
-
-  MAHOUT-1480: Clean up website on 20 newsgroups (Andrew Palumbo via ssc)
-
-  MAHOUT-1561: cluster-syntheticcontrol.sh not running locally with 
MAHOUT_LOCAL=true (Andrew Palumbo via ssc)
-
-  MAHOUT-1558: Clean up classify-wiki.sh and add in a binary classification 
problem (Andrew Palumbo via ssc)
-
-  MAHOUT-1560: Last batch is not filled correctly in 
MultithreadedBatchItemSimilarities (JarosÅaw Bojar)
-
-  MAHOUT-1554: Provide more comprehensive classification statistics (Karol 
Grzegorczyk via ssc)
-
-  MAHOUT-1548: Fix broken links in quickstart webpage (Andrew Palumbo via ssc)
-
-  MAHOUT-1542: Tutorial for playing with Mahout's Spark shell (ssc)
-
-  MAHOUT-1533: Remove Frequent Pattern Mining (ssc)
-
-  MAHOUT-1532: Add solve() function to the Scala DSL (ssc)
-
-  MAHOUT-1530: Custom prompt and welcome message for the Spark Shell (ssc)
-
-  MAHOUT-1527: Fix wikipedia classifier example (Andrew Palumbo via ssc)
-
-  MAHOUT-1526: Ant file in examples (ssc)
-
-  MAHOUT-1523: Remove @author tags in sparkbindings (ssc)
-
-  MAHOUT-1521: lucene2seq - Error trying to load data from stored field (when 
non-indexed) (Terry Blankers via frankscholten)
-
-  MAHOUT-1520: Fix links in Mahout website documentation (Saleem Ansari via 
smarthi)
-
-  MAHOUT-1519: Remove StandardThetaTrainer (Andrew Palumbo via ssc)
-
-  MAHOUT-1517: Remove casts to int in ALSWRFactorizer (ssc)
-
-  MAHOUT-1513: Deprecate Canopy Clustering (ssc)
-
-  MAHOUT-1511: Renaming core to mrlegacy (frankscholten)
-
-  MAHOUT-1510: Goodbye MapReduce (ssc)
-
-  MAHOUT-1509: Invalid URL in link from "quick start/basics" page (Nick 
Martin, smarthi)
-
-  MAHOUT-1508: Performance problems with sparse matrices (ssc)
-
-  MAHOUT-1505: structure of clusterdump's JSON output (akm)
-
-  MAHOUT-1504: Enable/fix thetaSummer job in TrainNaiveBayesJob (Andrew 
Palumbo, smarthi)
-
-  MAHOUT-1503: TestNaiveBayesDriver fails in sequential mode (Andrew Palumbo, 
smarthi)
-
-  MAHOUT-1502: Update Naive Bayes Webpage to Current Implementation (Andrew 
Palumbo via ssc)
-
-  MAHOUT-1501: ClusterOutputPostProcessorDriver has private default 
constructor (ssc)
-
-  MAHOUT-1498: DistributedCache.setCacheFiles in DictionaryVectorizer 
overwrites jars pushed using oozie (Sergey via ssc)
-
-  MAHOUT-1497: mahout resplit not producing splited files (ssc)
-
-  MAHOUT-1496: Create a website describing the distributed ALS recommender 
(Jian Wang via ssc)
-
-  MAHOUT-1491: Spectral KMeans Clustering doesn't clean its /tmp dir and fails 
when seeing it again (smarthi)
-
-  MAHOUT-1488: DisplaySpectralKMeans fails: 
examples/output/clusteredPoints/part-m-00000 does not exist (Saleem Ansari via 
smarthi)
-
-  MAHOUT-1483: Organize links in web site navigation bar (akm)
-
-  MAHOUT-1482: Rework quickstart website (Jian Wang via ssc)
-
-  MAHOUT-1476: Cleanup website on Hidden Markov Models (akm)
-
-  MAHOUT-1475: Cleanup website on Naive Bayes (smarthi)
-
-  MAHOUT-1472: Cleanup website on fuzzy kmeans (smarthi)
-
-  MAHOUT-1471: Cleanup website for Canopy clustering (smarthi)
-
-  MAHOUT-1468: Creating a new page for StreamingKMeans documentation on mahout 
website (Maxim Arap and Pavan Kumar via akm)
-
-  MAHOUT-1467: ClusterClassifier readPolicy leaks file handles (Avi Shinnar, 
smarthi)
-
-  MAHOUT-1466: Cluster visualization fails to execute (ssc)
-
-  MAHOUT-1465: Clean up README (akm)
-
-  MAHOUT-1463: Modify OnlineSummarizers to use the TDigest dependency from 
Maven Central (tdunning, smarthi)
-
-  MAHOUT-1460: Remove reference to Dirichlet in ClusterIterator (frankscholten)
-
-  MAHOUT-1459: Move Hadoop related code out of CanopyClusterer (frankscholten)
-
-  MAHOUT-1458: Remove KMeansConfigKeys and FuzzyKMeansConfigKeys 
(frankscholten)
-
-  MAHOUT-1457: Move EigenSeedGenerator into spectral kmeans package 
(frankscholten)
-
-  MAHOUT-1455: Forkcount config causes JVM crashes during build (frankscholten)
-
-  MAHOUT-1451: Cleaning up the examples for clustering on the website (Gaurav 
Misra via ssc)
-
-  MAHOUT-1450: Cleaning up clustering documentation on mahout website (Pavan 
Kumar)
-
-  MAHOUT-1449: Update the Known Issues in Random Forests Page (Manoj Awasthi 
via ssc)
-
-  MAHOUT-1448: In Random Forest, the training does not support multiple input 
files. The input dataset must be one single file. (Manoj Awasthi via ssc)
-
-  MAHOUT-1447: ImplicitFeedbackAlternatingLeastSquaresSolver tests and 
features (Adam Ilardi via ssc)
-
-  MAHOUT-1445: Create an intro for item based recommender (Nick Martin via ssc)
-
-  MAHOUT-1440: Add option to set the RNG seed for inital cluster generation in 
Kmeans/fKmeans (Andrew Palumbo via ssc)
-
-  MAHOUT-1438: "quickstart" tutorial for building a simple recommender (Maciej 
Mazur and Steve Cook via ssc)
-
-  MAHOUT-1434: Dead links on the web ste (Kevin Moulart, smarthi)
-
-  MAHOUT-1433: Make SVDRecommender look at all unknown items of a user per 
default (ssc)
-
-  MAHOUT-1429: Parallelize YtransposeY in 
ImplicitFeedbackAlternatingLeastSquaresSolver (Adam Ilardi via ssc)
-
-  MAHOUT-1428: Recommending already consumed items (Dodi Hakim via ssc)
-
-  MAHOUT-1425: SGD classifier example with bank marketing dataset. 
(frankscholten)
-
-  MAHOUT-1420: Add solr-recommender to examples (Pat Ferrel via akm)
-
-  MAHOUT-1419: Random decision forest is excessively slow on numeric features 
(srowen)
-
-  MAHOUT-1417: Random decision forest implementation fails in Hadoop 2 (srowen)
-
-  MAHOUT-1416: Make access of DecisionForest.read(dataInput) less restricted 
(Manoj Awasthi via smarthi)
-
-  MAHOUT-1415: Clone method on sparse matrices fails if there is an empty row 
which has not been set explicitly (till.rohrmann via ssc)
-
-  MAHOUT-1413: Rework Algorithms page (ssc)
-
-  MAHOUT-1388: Add command line support and logging for MLP (Yexi Jiang via 
ssc)
-
-  MAHOUT-1385: Caching Encoders don't cache (Johannes Schulte, Manoj Awasthi 
via ssc)
-
-  MAHOUT-1356: Ensure unit tests fail fast when writing outside mvn target 
directory (isabel, smarthi, dweiss, frankscholten, akm)
-
-  MAHOUT-1329: Mahout for hadoop 2 (gcapan, Sergey Svinarchuk)
-
-  MAHOUT-1310: Mahout support windows (Sergey Svinarchuk via ssc)
-
-  MAHOUT-1278: Upgraded to apache parent pom version 16 (sslavic)
-
-Release 0.9 - 2014-02-01
-
-  MAHOUT-1387: Create page for release notes (ssc)
-
-  MAHOUT-1411: Random test failures from TDigestTest (smarthi)
-
-  MAHOUT-1410: clusteredPoints do not contain a vector id (smarthi, Andrew 
Musselman)
-
-  MAHOUT-1409: MatrixVectorView has index check error (tdunning)
-
-  MAHOUT-1402: Zero clusters using streaming k-means option in 
cluster-reuters.sh (smarthi)
-
-  MAHOUT-1401: Resurrect Frequent Pattern mining (smarthi)
-
-  MAHOUT-1400: Remove references to deprecated and removed algorithms from 
examples scripts (ssc)
-
-  MAHOUT-1399: Fixed multiple slf4j bindings when running Mahout examples 
issue (sslavic)
-
-  MAHOUT-1398: FileDataModel should provide a constructor with a 
delimiterPattern (Roy Guo via ssc)
-
-  MAHOUT-1396: Accidental use of commons-math won't work with next Hadoop 2 
release (srowen)
-
-  MAHOUT-1394: Undeprecate Lanczos (ssc)
-
-  MAHOUT-1393: Remove duplicated code from getTopTerms and getTopFeatures in 
AbstractClusterWriter (Diego Carrion via smarthi)
-
-  MAHOUT-1392: Streaming KMeans should write centroid output to a 
'part-r-xxxx' file when executed in sequential mode (smarthi)
-
-  MAHOUT-1390: SVD hangs for certain inputs (tdunning)
-
-  MAHOUT-1389: Complementary Naive Bayes Classifier not getting called when 
"-c" option is activated (Gouri Shankar Majumdar via smarthi)
-
-  MAHOUT-1384: Executing the MR version of Naive Bayes/CNB of 
classify_20newgroups.sh fails in seqdirectory step (smarthi)
-
-  MAHOUT-1382: Upgrade Mahout third party jars for 0.9 Release (smarthi)
-
-  MAHOUT-1380: Streaming KMeans fails when executed in Sequential Mode 
(smarthi)
-
-  MAHOUT-1379: ClusterQualitySummarizer fails with the new T-Digest for 
clusters with 1 data point (smarthi)
-
-  MAHOUT-1378: Running Random Forest with Ignored features fails when loading 
feature descriptor from JSON file (Sam Wu via smarthi)
-
-  MAHOUT-1377: Exclude JUnit.jar from tarball (Sergey Svinarchuk via smarthi)
-
-  MAHOUT-1374: Ability to provide input file with userid, itemid pair 
(Aliaksei Litouka via ssc)
-
-  MAHOUT-1371: Arff loader can misinterpret nominals with integer, real or 
string (Mansur Iqbal via smarthi)
-
-  MAHOUT-1370: Vectordump doesn't write to output file in MapReduce Mode 
(smarthi)
-
-  MAHOUT-1368: Convert OnlineSummarizer to use the new TDigest (tdunning)
-
-  MAHOUT-1367: WikipediaXmlSplitter --> Exception in thread "main" 
java.lang.NullPointerException (smarthi)
-
-  MAHOUT-1364: Upgrade Mahout codebase to Lucene 4.6 (Frank Scholten)
-
-  MAHOUT-1363: Rebase packages in mahout-scala (dlyubimov)
-
-  MAHOUT-1362: Remove examples/bin/build-reuters.sh (smarthi)
-
-  MAHOUT-1361: Online algorithm for computing accurate Quantiles using 1-D 
clustering (tdunning)
-
-  MAHOUT-1358: StreamingKMeansThread throws IllegalArgumentException when 
REDUCE_STREAMING_KMEANS is set to true (smarthi)
-
-  MAHOUT-1355: InteractionValueEncoder produces wrong traceDictionary entries 
(Johannes Schulte via smarthi)
-
-  MAHOUT-1353: Visibility of preparePreferenceMatrix directory location (Pat 
Ferrel, ssc)
-
-  MAHOUT-1352: Option to change RecommenderJob output format (Pat Ferrel, ssc)
-
-  MAHOUT-1351: Adding DenseVector support to AbstractCluster (David DeBarr via 
smarthi)
-
-  MAHOUT-1349: Clusterdumper/loadTermDictionary crashes when highest index in 
(sparse) dictionary vector is larger than dictionary vector size (Andrew 
Musselman via smarthi)
-
-  MAHOUT-1347: Add Streaming K-Means clustering algorithm to 
examples/bin/cluster-reuters.sh (smarthi)
-
-  MAHOUT-1345: Enable randomised testing for all Mahout modules (Dawid Weiss, 
Isabel, sslavic, Frank Scholten, smarthi)
-
-  MAHOUT-1343: JSON output format support in cluster dumper (Telvis Calhoun 
via sslavic)
-
-  MAHOUT-1333: Fixed examples bin directory permissions in distribution 
archives (Mike Percy via sslavic)
-
-  MAHOUT-1319: seqdirectory -filter argument silently ignored when run as MR 
(smarthi)
-
-  MAHOUT-1317: Clarify some of the messages in Preconditions.checkArgument 
(Nikolai Grinko, smarthi)
-
-  MAHOUT-1314: StreamingKMeansReducer throws NullPointerException when 
REDUCE_STREAMING_KMEANS is set to true (smarthi)
-
-  MAHOUT-1313: Fixed unwanted integral division bug in RowSimilarityJob 
downsampling code where precision should have been retained (sslavic)
-
-  MAHOUT-1312: LocalitySensitiveHashSearch does not limit search results 
(sslavic)
-
-  MAHOUT-1308: Cannot extend CandidateItemsStrategy due to restricted 
visibility (David Geiger, smarthi)
-
-  MAHOUT-1301: toString() method of SequentialAccessSparseVector has excess 
comma at the end (Alexander Senov, smarthi)
-
-  MAHOUT-1297: New module for linear algebra scala DSL (dlyubimov)
-
-  MAHOUT-1296: Remove deprecated algorithms (ssc)
-
-  MAHOUT-1295: Excluded all Maven's target directories from distribution 
archives (sslavic)
-
-  MAHOUT-1294: Cleanup previously installed artifacts from CI server local 
repository (sslavic)
-
-  MAHOUT-1293: Source distribution tar.gz archive cannot be unpacked on Linux 
(sslavic)
-
-  MAHOUT-1292: lucene2seq should validate the 'id' field (Frank Scholten via 
smarthi)
-
-  MAHOUT-1291: MahoutDriver yields cosmetically suboptimal exception when 
bin/mahout runs without args, on some Hadoop versions (srowen)
-
-  MAHOUT-1290: Issue when running Mahout Recommender Demo (Helder Garay 
Martins via smarthi)
-
-  MAHOUT-1289: Move downsampling code into RowSimilarityJob (ssc)
-
-  MAHOUT-1287: classifier.sgd.CsvRecordFactory incorrectly parses CSV format 
(Alex Franchuk via smarthi)
-
-  MAHOUT-1285: Arff loader can misparse string data as double (smarthi)
-
-  MAHOUT-1284: DummyRecordWriter's bug with reused Writables (Maysam Yabandeh 
via smarthi)
-
-  MAHOUT-1275: Dropped bz2 distribution format for source and binaries 
(sslavic)
-
-  MAHOUT-1265: Multilayer Perceptron (Yexi Jiang via smarthi)
-
-  MAHOUT-1261: TasteHadoopUtils.idToIndex can return an int that has size 
Integer.MAX_VALUE (Carl Clark, smarthi)
-
-  MAHOUT-1242: No key redistribution function for associative maps (Tharindu 
Rusira via smarthi)
-
-  MAHOUT-1030: Regression: Clustered Points Should be 
WeightedPropertyVectorWritable not WeightedVectorWritable (Andrew Musselman, 
Pat Ferrel, Jeff Eastman, Lars Norskog, smarthi)
-
-Release 0.8 - 2013-07-25
-
-  MAHOUT-1272: Parallel SGD matrix factorizer for SVDrecommender (Peng Cheng 
via ssc)
-
-  MAHOUT-1271: classify-20newsgroups.sh fails during the seqdirectory step 
(smarthi)
-
-  MAHOUT-1269: Cleanup deprecated Lucene 3.x API calls in lucene2seq utility 
unit tests (smarthi)
-
-  MAHOUT-833: Make conversion to sequence files map-reduce (Josh Patterson, 
smarthi)
-
-  MAHOUT-1268: Wrong output directory for CVB (Mark Wicks via ssc)
-
-  MAHOUT-1264: Performance optimizations in RecommenderJob (ssc)
-
-  MAHOUT-1262: Cleanup LDA code (ssc)
-
-  MAHOUT-1255: Fix for weights in Multinomial sometimes overflowing in 
BallKMeans (dfilimon)
-
-  MAHOUT-1254: Final round of cleanup for StreamingKMeans (dfilimon)
-
-  MAHOUT-1263: Serialise/Deserialise Lambda value for OnlineLogisticRegression 
(Mike Davy via smarthi)
-
-  MAHOUT-1258: Another shot at findbugs and checkstyle (ssc)
-
-  MAHOUT-1253: Add experiment tools for StreamingKMeans, part 1 (dfilimon)
-
-  MAHOUT-884:  Matrix Concatenate Utility (Lance Norskog via smarthi)
-
-  MAHOUT-1250: Deprecate unused algorithms (ssc)
-
-  MAHOUT-1251: Optimize MinHashMapper (ssc)
-
-  MAHOUT-1211: Disabled swallowing of IOExceptions is Closeables.close for 
writers (dfilimon)
-
-  MAHOUT-1164: Make ARFF integration generate meta-data in JSON format (Marty 
Kube via ssc)
-
-  MAHOUT-1164: Make ARFF integration generate meta-data in JSON format (Marty 
Kube via ssc)
-
-  MAHOUT-1163: Make random forest classifier meta-data file human readable 
(Marty Kube via ssc)
-
-  MAHOUT-1243: Dictionary file format in Lucene-Mahout integration is not in 
SequenceFileFormat (ssc)
-
-  MAHOUT-974:  
org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob use integer 
as userId and itemId (ssc)
-
-  MAHOUT-1052: Add an option to MinHashDriver that specifies the dimension of 
vector to hash (indexes or values) (Elena Smirnova via smarthi)
-
-  MAHOUT-1237: Total cluster cost isn't computed properly (dfilimon)
-
-  MAHOUT-1196: LogisticModelParameters uses csv.getTargetCategories() even if 
csv is not used. (Vineet Krishnan via ssc)
-
-  MAHOUT-1224: Add the option of running a StreamingKMeans pass in the Reducer 
before BallKMeans (dfilimon)
-
-  MAHOUT-993:  Some vector dumper flags are expecting arguments. (Andrew Look 
via robinanil)
-
-  MAHOUT-1228: Cleanup .gitignore (Stevo Slavic via ssc)
-
-  MAHOUT-1047: CVB hangs after completion (Angel Martinez Gonzalez via smarthi)
-
-  MAHOUT-1235: ParallelALSFactorizationJob does not use VectorSumCombiner (ssc)
-
-  MAHOUT-1230: SparceMatrix.clone() is not deep copy (Maysam Yabandeh via 
tdunning)
-
-  MAHOUT-1232: VectorHelper.topEntries() throws a NPE when number of NonZero 
elements in vector < maxEntries (smarthi)
-
-  MAHOUT-1229: Conf directory content from Mahout distribution archives cannot 
be unpacked (Stevo Slavic via smarthi)
-
-  MAHOUT-1213: SSVD job doesn't clean it's temp dir, and fails when seeing it 
again (smarthi)
-
-  MAHOUT-1223: Fixed point skipped in StreamingKMeans when iterating through 
centroids from a reducer (dfilimon)
-
-  MAHOUT-1222: Fix total weight in FastProjectionSearch (dfilimon)
-
-  MAHOUT-1219: Remove LSHSearcher from StreamingKMeansTest. It causes it to 
sometimes fail (dfilimon)
-
-  MAHOUT-1221: SparseMatrix.viewRow is sometimes readonly. (Maysam Yabandeh 
via smarthi)
-
-  MAHOUT-1219: Remove LSHSearcher from SearchQualityTest. It causes it to 
fail, but the failure is not very meaningful (dfilimon)
-
-  MAHOUT-1217: Nearest neighbor searchers sometimes fail to remove points: fix 
in FastProjectionSearch's searchFirst (dfilimon)
-
-  MAHOUT-1216: Add locality sensitive hashing and a LocalitySensitiveHash 
searcher (dfilimon)
-
-  MAHOUT-1181: Adding StreamingKMeans MapReduce classes (dfilimon)
-
-  MAHOUT-1212: Incorrect classify-20newsgroups.sh file description (Julian 
Ortega via smarthi)
-
-  MAHOUT-1209: DRY out maven-compiler-plugin configuration (Stevo Slavic via 
smarthi)
-
-  MAHOUT-1207: Fix typos in description in parent pom (Stevo Slavic via 
smarthi)
-
-  MAHOUT-1199: Improve javadoc comments of mahout-integration (Angel Martinez 
Gonzalez via smarthi)
-
-  MAHOUT-1162: Adding BallKMeans and StreamingKMeans clustering algorithms 
(dfilimon)
-
-  MAHOUT-1205: ParallelALSFactorizationJob should leverage the distributed 
cache (ssc)
-
-  MAHOUT-1156: Adding nearest neighbor Searchers (dfilimon)
-
-  MAHOUT-1202: Speed up Vector operations (dfilimon)
-
-  MAHOUT-1155: Make MatrixSlice a Vector (and fix Centroid cloning; 
MAHOUT-1202) (dfilimon)
-
-  MAHOUT-1189: CosineDistanceMeasure doesn't return 0 for two 0 vectors 
(dfilimon)
-
-  MAHOUT-1180: Multinomial<T> throws ConcurrentModificationException when 
iterating and setting probabilities (dfilimon)
-
-  MAHOUT-1192: Speed up Vector Operations (robinanil)
-
-  MAHOUT-1191: Cleanup Vector Benchmarks make it less variable (robinanil)
-
-  MAHOUT-1190: SequentialAccessSparseVector function assignment is very slow 
and other iterator woes (robinanil)
-
-  MAHOUT-1188: Inconsistent reference to Lucene versions in code and POM 
(smarthi)
-
-  MAHOUT-1161: Unable to run CJKAnalyzer for conversion of a sequence file to 
sparse vector due to instantiation exception (ssc)
-
-  MAHOUT-1187: Update Commons Lang to Commons Lang3 (smarthi)
-
-  MAHOUT-1184 Another take at pmd, findbugs and checkstyle (ssc)
-
-  MAHOUT-1182: Remove useless append (Dave Brosius via tdunning)
-
-  MAHOUT-1176: Introduce a changelog file to raise contributors attribution 
(ssc)
-
-  MAHOUT-1108: Allows cluster-reuters.sh example to be executed on a cluster 
(elmer.garduno via gsingers)
-
-  MAHOUT-961: Fix issue in decision forest tree visualizer to properly show 
stems of tree (Ikumasa Mukai via gsingers)
-
-  MAHOUT-944: Create SequenceFiles out of Lucene document storage (no term 
vectors required) (Frank Scholten, gsingers)
-
-  MAHOUT-958: Fix issue with globs in RepresentativePointsDriver (Adam Baron, 
Vikram Dixit K, ehgjr via gsingers)
-
-  MAHOUT-1084: Fixed issue with too many clusters in synthetic control example 
(liutengfei, gsingers)
-
-  MAHOUT-1103: Fixed issue with splitting clusters on Hadoop (Matt Molek, 
gsingers)
-
-  MAHOUT-1126: Filter out bad META-INF files in job packaging (Pat Ferrel, 
gsingers)
-
-  MAHOUT-1211: Change deprecated Closeables.closeQuietly calls (smarthi, 
gsingers, srowen, dlyubimov)

http://git-wip-us.apache.org/repos/asf/mahout/blob/b5fe4aab/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala
----------------------------------------------------------------------
diff --git 
a/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala 
b/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala
index 4632468..a10b942 100644
--- 
a/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala
+++ 
b/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala
@@ -44,23 +44,34 @@ object SimilarityAnalysis extends Serializable {
   /** Compares (Int,Double) pairs by the second value */
   private val orderByScore = Ordering.fromLessThan[(Int, Double)] { case ((_, 
score1), (_, score2)) => score1 > score2}
 
+  lazy val defaultParOpts = ParOpts()
+
   /**
    * Calculates item (column-wise) similarity using the log-likelihood ratio 
on A'A, A'B, A'C, ...
    * and returns a list of similarity and cross-similarity matrices
-   * @param drmARaw Primary interaction matrix
+    *
+    * @param drmARaw Primary interaction matrix
    * @param randomSeed when kept to a constant will make repeatable 
downsampling
    * @param maxInterestingItemsPerThing number of similar items to return per 
item, default: 50
    * @param maxNumInteractions max number of interactions after downsampling, 
default: 500
+   * @param parOpts partitioning params for drm.par(...)
    * @return a list of [[org.apache.mahout.math.drm.DrmLike]] containing 
downsampled DRMs for cooccurrence and
    *         cross-cooccurrence
    */
-  def cooccurrences(drmARaw: DrmLike[Int], randomSeed: Int = 0xdeadbeef, 
maxInterestingItemsPerThing: Int = 50,
-                    maxNumInteractions: Int = 500, drmBs: Array[DrmLike[Int]] 
= Array()): List[DrmLike[Int]] = {
+  def cooccurrences(
+    drmARaw: DrmLike[Int],
+    randomSeed: Int = 0xdeadbeef,
+    maxInterestingItemsPerThing: Int = 50,
+    maxNumInteractions: Int = 500,
+    drmBs: Array[DrmLike[Int]] = Array(),
+    parOpts: ParOpts = defaultParOpts)
+    : List[DrmLike[Int]] = {
 
     implicit val distributedContext = drmARaw.context
 
-    // backend allowed to optimize partitioning
-    drmARaw.par(auto = true)
+    // backend partitioning defaults to 'auto', which is often better decided 
by calling funciton
+    // todo:  this should ideally be different per drm
+    drmARaw.par( min = parOpts.minPar, exact = parOpts.exactPar, auto = 
parOpts.autoPar)
 
     // Apply selective downsampling, pin resulting matrix
     val drmA = sampleDownAndBinarize(drmARaw, randomSeed, maxNumInteractions)
@@ -82,8 +93,9 @@ object SimilarityAnalysis extends Serializable {
 
     // Now look at cross cooccurrences
     for (drmBRaw <- drmBs) {
-      // backend allowed to optimize partitioning
-      drmBRaw.par(auto = true)
+      // backend partitioning defaults to 'auto', which is often better 
decided by calling funciton
+      // todo:  this should ideally be different per drm
+      drmARaw.par( min = parOpts.minPar, exact = parOpts.exactPar, auto = 
parOpts.autoPar)
 
       // Down-sample and pin other interaction matrix
       val drmB = sampleDownAndBinarize(drmBRaw, randomSeed, 
maxNumInteractions).checkpoint()
@@ -100,21 +112,11 @@ object SimilarityAnalysis extends Serializable {
       similarityMatrices = similarityMatrices :+ drmSimilarityAtB
 
       drmB.uncache()
-
-      //debug
-      val atbRows = drmSimilarityAtB.nrow
-      val atbCols = drmSimilarityAtB.ncol
-      val i = 0
     }
 
     // Unpin downsampled interaction matrix
     drmA.uncache()
 
-    //debug
-    val ataRows = drmSimilarityAtA.nrow
-    val ataCols = drmSimilarityAtA.ncol
-    val i = 0
-
     // Return list of similarity matrices
     similarityMatrices
   }
@@ -123,23 +125,27 @@ object SimilarityAnalysis extends Serializable {
    * Calculates item (column-wise) similarity using the log-likelihood ratio 
on A'A, A'B, A'C, ... and returns
    * a list of similarity and cross-similarity matrices. Somewhat easier to 
use method, which handles the ID
    * dictionaries correctly
+   *
    * @param indexedDatasets first in array is primary/A matrix all others are 
treated as secondary
    * @param randomSeed use default to make repeatable, otherwise pass in 
system time or some randomizing seed
    * @param maxInterestingItemsPerThing max similarities per items
    * @param maxNumInteractions max number of input items per item
+   * @param parOpts partitioning params for drm.par(...)
    * @return a list of 
[[org.apache.mahout.math.indexeddataset.IndexedDataset]] containing downsampled
    *         IndexedDatasets for cooccurrence and cross-cooccurrence
    */
-  def cooccurrencesIDSs(indexedDatasets: Array[IndexedDataset],
-      randomSeed: Int = 0xdeadbeef,
-      maxInterestingItemsPerThing: Int = 50,
-      maxNumInteractions: Int = 500):
+  def cooccurrencesIDSs(
+    indexedDatasets: Array[IndexedDataset],
+    randomSeed: Int = 0xdeadbeef,
+    maxInterestingItemsPerThing: Int = 50,
+    maxNumInteractions: Int = 500,
+    parOpts: ParOpts = defaultParOpts):
     List[IndexedDataset] = {
     val drms = indexedDatasets.map(_.matrix.asInstanceOf[DrmLike[Int]])
     val primaryDrm = drms(0)
     val secondaryDrms = drms.drop(1)
     val coocMatrices = cooccurrences(primaryDrm, randomSeed, 
maxInterestingItemsPerThing,
-      maxNumInteractions, secondaryDrms)
+      maxNumInteractions, secondaryDrms, parOpts)
     val retIDSs = coocMatrices.iterator.zipWithIndex.map {
       case( drm, i ) =>
         indexedDatasets(0).create(drm, indexedDatasets(0).columnIDs, 
indexedDatasets(i).columnIDs)
@@ -148,19 +154,110 @@ object SimilarityAnalysis extends Serializable {
   }
 
   /**
+    * Calculates item (column-wise) similarity using the log-likelihood ratio 
on A'A, A'B, A'C, ... and returns
+    * a list of similarity and cross-occurrence matrices. Somewhat easier to 
use method, which handles the ID
+    * dictionaries correctly and contains info about downsampling in each 
model calc.
+    *
+    * @param datasets first in array is primary/A matrix all others are 
treated as secondary, includes information
+    *                 used to downsample the input drm as well as the output 
llr(A'A), llr(A'B). The information
+    *                 is contained in each dataset in the array and applies to 
the model calculation of A' with
+    *                 the dataset. Todo: ignoring absolute threshold for now.
+    * @param randomSeed use default to make repeatable, otherwise pass in 
system time or some randomizing seed
+    * @param parOpts partitioning params for drm.par(...)
+    * @return a list of 
[[org.apache.mahout.math.indexeddataset.IndexedDataset]] containing downsampled
+    *         IndexedDatasets for cooccurrence and cross-cooccurrence
+    */
+  def crossOccurrenceDownsampled(
+    datasets: List[DownsamplableCrossOccurrenceDataset],
+    randomSeed: Int = 0xdeadbeef):
+    List[IndexedDataset] = {
+
+
+    val crossDatasets = datasets.drop(1) // drop A
+    val primaryDataset = datasets.head // use A throughout
+    val drmARaw = primaryDataset.iD.matrix
+
+    implicit val distributedContext = primaryDataset.iD.matrix.context
+
+    // backend partitioning defaults to 'auto', which is often better decided 
by calling funciton
+    val parOptsA = primaryDataset.parOpts.getOrElse(defaultParOpts)
+    drmARaw.par( min = parOptsA.minPar, exact = parOptsA.exactPar, auto = 
parOptsA.autoPar)
+
+    // Apply selective downsampling, pin resulting matrix
+    val drmA = sampleDownAndBinarize(drmARaw, randomSeed, 
primaryDataset.maxElementsPerRow)
+
+    // num users, which equals the maximum number of interactions per item
+    val numUsers = drmA.nrow.toInt
+
+    // Compute & broadcast the number of interactions per thing in A
+    val bcastInteractionsPerItemA = 
drmBroadcast(drmA.numNonZeroElementsPerColumn)
+
+    // Compute cooccurrence matrix A'A
+    val drmAtA = drmA.t %*% drmA
+
+    // Compute loglikelihood scores and sparsify the resulting matrix to get 
the similarity matrix
+    val drmSimilarityAtA = computeSimilarities(drmAtA, numUsers, 
primaryDataset.maxInterestingElements,
+      bcastInteractionsPerItemA, bcastInteractionsPerItemA, crossCooccurrence 
= false,
+      minLLROpt = primaryDataset.minLLROpt)
+
+    var similarityMatrices = List(drmSimilarityAtA)
+
+    // Now look at cross cooccurrences
+    for (dataset <- crossDatasets) {
+      // backend partitioning defaults to 'auto', which is often better 
decided by calling funciton
+      val parOptsB = dataset.parOpts.getOrElse(defaultParOpts)
+      dataset.iD.matrix.par(min = parOptsB.minPar, exact = parOptsB.exactPar, 
auto = parOptsB.autoPar)
+
+      // Downsample and pin other interaction matrix
+      val drmB = sampleDownAndBinarize(dataset.iD.matrix, randomSeed, 
dataset.maxElementsPerRow).checkpoint()
+
+      // Compute & broadcast the number of interactions per thing in B
+      val bcastInteractionsPerThingB = 
drmBroadcast(drmB.numNonZeroElementsPerColumn)
+
+      // Compute cross-cooccurrence matrix A'B
+      val drmAtB = drmA.t %*% drmB
+
+      val drmSimilarityAtB = computeSimilarities(drmAtB, numUsers, 
dataset.maxInterestingElements,
+        bcastInteractionsPerItemA, bcastInteractionsPerThingB, minLLROpt = 
dataset.minLLROpt)
+
+      similarityMatrices = similarityMatrices :+ drmSimilarityAtB
+
+      drmB.uncache()
+    }
+
+    // Unpin downsampled interaction matrix
+    drmA.uncache()
+
+    // Return list of datasets
+    val retIDSs = similarityMatrices.iterator.zipWithIndex.map {
+      case( drm, i ) =>
+        datasets(0).iD.create(drm, datasets(0).iD.columnIDs, 
datasets(i).iD.columnIDs)
+    }
+    retIDSs.toList
+
+  }
+
+  /**
    * Calculates row-wise similarity using the log-likelihood ratio on AA' and 
returns a DRM of rows and similar rows
+   *
    * @param drmARaw Primary interaction matrix
    * @param randomSeed when kept to a constant will make repeatable 
downsampling
    * @param maxInterestingSimilaritiesPerRow number of similar items to return 
per item, default: 50
    * @param maxNumInteractions max number of interactions after downsampling, 
default: 500
+   * @param parOpts partitioning options used for drm.par(...)
    */
-  def rowSimilarity(drmARaw: DrmLike[Int], randomSeed: Int = 0xdeadbeef, 
maxInterestingSimilaritiesPerRow: Int = 50,
-                    maxNumInteractions: Int = 500): DrmLike[Int] = {
+  def rowSimilarity(
+    drmARaw: DrmLike[Int],
+    randomSeed: Int = 0xdeadbeef,
+    maxInterestingSimilaritiesPerRow: Int = 50,
+    maxNumInteractions: Int = 500,
+    parOpts: ParOpts = defaultParOpts): DrmLike[Int] = {
 
     implicit val distributedContext = drmARaw.context
 
-    // backend allowed to optimize partitioning
-    drmARaw.par(auto = true)
+    // backend partitioning defaults to 'auto', which is often better decided 
by calling funciton
+    // todo: should this ideally be different per drm?
+    drmARaw.par(min = parOpts.minPar, exact = parOpts.exactPar, auto = 
parOpts.autoPar)
 
     // Apply selective downsampling, pin resulting matrix
     val drmA = sampleDownAndBinarize(drmARaw, randomSeed, maxNumInteractions)
@@ -184,6 +281,7 @@ object SimilarityAnalysis extends Serializable {
   /**
    * Calculates row-wise similarity using the log-likelihood ratio on AA' and 
returns a drm of rows and similar rows.
    * Uses IndexedDatasets, which handle external ID dictionaries properly
+   *
    * @param indexedDataset compare each row to every other
    * @param randomSeed  use default to make repeatable, otherwise pass in 
system time or some randomizing seed
    * @param maxInterestingSimilaritiesPerRow max elements returned in each row
@@ -211,9 +309,17 @@ object SimilarityAnalysis extends Serializable {
 
   }
 
-  def computeSimilarities(drm: DrmLike[Int], numUsers: Int, 
maxInterestingItemsPerThing: Int,
-                        bcastNumInteractionsB: BCast[Vector], 
bcastNumInteractionsA: BCast[Vector],
-                        crossCooccurrence: Boolean = true) = {
+  def computeSimilarities(
+    drm: DrmLike[Int],
+    numUsers: Int,
+    maxInterestingItemsPerThing: Int,
+    bcastNumInteractionsB: BCast[Vector],
+    bcastNumInteractionsA: BCast[Vector],
+    crossCooccurrence: Boolean = true,
+    minLLROpt: Option[Double] = None) = {
+
+    val minLLR = minLLROpt.getOrElse(0.0d) // accept all values if not 
specified
+
     drm.mapBlock() {
       case (keys, block) =>
 
@@ -245,11 +351,13 @@ object SimilarityAnalysis extends Serializable {
               // val candidate = thingA -> normailizedLLR
 
               // Enqueue item with score, if belonging to the top-k
-              if (topItemsPerThing.size < maxInterestingItemsPerThing) {
-                topItemsPerThing.enqueue(candidate)
-              } else if (orderByScore.lt(candidate, topItemsPerThing.head)) {
-                topItemsPerThing.dequeue()
-                topItemsPerThing.enqueue(candidate)
+              if(candidate._2 >= minLLR) { // llr threshold takes precedence 
over max per row
+                if (topItemsPerThing.size < maxInterestingItemsPerThing) {
+                  topItemsPerThing.enqueue(candidate)
+                } else if (orderByScore.lt(candidate, topItemsPerThing.head)) {
+                  topItemsPerThing.dequeue()
+                  topItemsPerThing.enqueue(candidate)
+                }
               }
             }
           }
@@ -270,6 +378,7 @@ object SimilarityAnalysis extends Serializable {
    * 
https://github.com/tdunning/in-memory-cooccurrence/blob/master/src/main/java/com/tdunning/cooc/Analyze.java
    *
    * additionally binarizes input matrix, as we're only interesting in knowing 
whether interactions happened or not
+   *
    * @param drmM matrix to downsample
    * @param seed random number generator seed, keep to a constant if 
repeatability is neccessary
    * @param maxNumInteractions number of elements in a row of the returned 
matrix
@@ -325,3 +434,18 @@ object SimilarityAnalysis extends Serializable {
     downSampledDrmI
   }
 }
+
+case class ParOpts( // this will contain the default `par` params except for 
auto = true
+  minPar: Int = -1,
+  exactPar: Int = -1,
+  autoPar: Boolean = true)
+
+/* Used to pass in data and params for downsampling the input data as well as 
output A'A, A'B, etc. */
+case class DownsamplableCrossOccurrenceDataset(
+  iD: IndexedDataset,
+  maxElementsPerRow: Int = 500, // usually items per user in the input 
dataset, used to ramdomly downsample
+  maxInterestingElements: Int = 50, // number of items/columns to keep in the 
A'A, A'B etc. where iD == A, B, C ...
+  minLLROpt: Option[Double] = None, // absolute threshold, takes precedence 
over maxInterestingElements if present
+  parOpts: Option[ParOpts] = None) // these can be set per dataset and are 
applied to each of the drms
+                                // in crossOccurrenceDownsampled
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/b5fe4aab/spark/src/test/scala/org/apache/mahout/cf/SimilarityAnalysisSuite.scala
----------------------------------------------------------------------
diff --git 
a/spark/src/test/scala/org/apache/mahout/cf/SimilarityAnalysisSuite.scala 
b/spark/src/test/scala/org/apache/mahout/cf/SimilarityAnalysisSuite.scala
index 0b3b3eb..63e0df7 100644
--- a/spark/src/test/scala/org/apache/mahout/cf/SimilarityAnalysisSuite.scala
+++ b/spark/src/test/scala/org/apache/mahout/cf/SimilarityAnalysisSuite.scala
@@ -17,9 +17,11 @@
 
 package org.apache.mahout.cf
 
-import org.apache.mahout.math.cf.SimilarityAnalysis
+import org.apache.mahout.math.cf.{DownsamplableCrossOccurrenceDataset, 
SimilarityAnalysis}
 import org.apache.mahout.math.drm._
+import org.apache.mahout.math.indexeddataset.BiDictionary
 import org.apache.mahout.math.scalabindings.{MatrixOps, _}
+import org.apache.mahout.sparkbindings.indexeddataset.IndexedDatasetSpark
 import org.apache.mahout.sparkbindings.test.DistributedSparkSuite
 import org.apache.mahout.test.MahoutSuite
 import org.scalatest.FunSuite
@@ -58,7 +60,7 @@ class SimilarityAnalysisSuite extends FunSuite with 
MahoutSuite with Distributed
     (1.7260924347106847, 0.6795961471815897, 0.6795961471815897, 
1.7260924347106847, 0.0),
     (0.0,                0.0,                0.0,                0.0,          
      4.498681156950466))
 
-  final val matrixLLRCoocBtAControl = dense(
+  final val matrixLLRCoocAtBControl = dense(
       (1.7260924347106847, 1.7260924347106847, 1.7260924347106847, 
1.7260924347106847, 0.0),
       (0.6795961471815897, 0.6795961471815897, 0.6795961471815897, 
0.6795961471815897, 0.0),
       (0.6795961471815897, 0.6795961471815897, 0.6795961471815897, 
0.6795961471815897, 0.0),
@@ -66,7 +68,7 @@ class SimilarityAnalysisSuite extends FunSuite with 
MahoutSuite with Distributed
       (0.0,                0.0,                0.6795961471815897, 0.0,        
        4.498681156950466))
 
 
-  test("cooccurrence [A'A], [B'A] boolbean data using LLR") {
+  test("Cross-occurrence [A'A], [B'A] boolbean data using LLR") {
     val a = dense(
         (1, 1, 0, 0, 0),
         (0, 0, 1, 1, 0),
@@ -91,13 +93,13 @@ class SimilarityAnalysisSuite extends FunSuite with 
MahoutSuite with Distributed
 
     //cross similarity
     val matrixCrossCooc = drmCooc(1).checkpoint().collect
-    val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocBtAControl)
+    val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocAtBControl)
     n = (new MatrixOps(m = diff2Matrix)).norm
     n should be < 1E-10
 
   }
 
-  test("cooccurrence [A'A], [B'A] double data using LLR") {
+  test("Cross-occurrence [A'A], [B'A] double data using LLR") {
     val a = dense(
         (100000.0D, 1.0D,  0.0D,  0.0D,     0.0D),
         (     0.0D, 0.0D, 10.0D,  1.0D,     0.0D),
@@ -122,12 +124,12 @@ class SimilarityAnalysisSuite extends FunSuite with 
MahoutSuite with Distributed
 
     //cross similarity
     val matrixCrossCooc = drmCooc(1).checkpoint().collect
-    val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocBtAControl)
+    val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocAtBControl)
     n = (new MatrixOps(m = diff2Matrix)).norm
     n should be < 1E-10
   }
 
-  test("cooccurrence [A'A], [B'A] integer data using LLR") {
+  test("Cross-occurrence [A'A], [B'A] integer data using LLR") {
     val a = dense(
         ( 1000,  10,       0,    0,   0),
         (    0,   0,  -10000,   10,   0),
@@ -154,12 +156,12 @@ class SimilarityAnalysisSuite extends FunSuite with 
MahoutSuite with Distributed
 
     //cross similarity
     val matrixCrossCooc = drmCooc(1).checkpoint().collect
-    val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocBtAControl)
+    val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocAtBControl)
     n = (new MatrixOps(m = diff2Matrix)).norm
     n should be < 1E-10
   }
 
-  test("cooccurrence two matrices with different number of columns"){
+  test("Cross-occurrence two matrices with different number of columns"){
     val a = dense(
       (1, 1, 0, 0, 0),
       (0, 0, 1, 1, 0),
@@ -172,7 +174,7 @@ class SimilarityAnalysisSuite extends FunSuite with 
MahoutSuite with Distributed
       (0, 0, 1, 0),
       (1, 1, 0, 1))
 
-    val matrixLLRCoocBtANonSymmetric = dense(
+    val matrixLLRCoocAtBNonSymmetric = dense(
       (0.0,                1.7260924347106847, 1.7260924347106847, 
1.7260924347106847),
       (0.0,                0.6795961471815897, 0.6795961471815897, 0.0),
       (1.7260924347106847, 0.6795961471815897, 0.6795961471815897, 0.0),
@@ -191,7 +193,7 @@ class SimilarityAnalysisSuite extends FunSuite with 
MahoutSuite with Distributed
 
     //cross similarity
     val matrixCrossCooc = drmCooc(1).checkpoint().collect
-    val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocBtANonSymmetric)
+    val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocAtBNonSymmetric)
     n = (new MatrixOps(m = diff2Matrix)).norm
 
     //cooccurrence without LLR is just a A'B
@@ -199,6 +201,107 @@ class SimilarityAnalysisSuite extends FunSuite with 
MahoutSuite with Distributed
     //val bp = 0
   }
 
+  test("Cross-occurrence two IndexedDatasets"){
+    val a = dense(
+      (1, 1, 0, 0, 0),
+      (0, 0, 1, 1, 0),
+      (0, 0, 0, 0, 1),
+      (1, 0, 0, 1, 0))
+
+    val b = dense(
+      (0, 1, 1, 0),
+      (1, 1, 1, 0),
+      (0, 0, 1, 0),
+      (1, 1, 0, 1))
+
+    val users = Seq("u1", "u2", "u3", "u4")
+    val itemsA = Seq("a1", "a2", "a3", "a4", "a5")
+    val itemsB = Seq("b1", "b2", "b3", "b4")
+    val userDict = new BiDictionary(users)
+    val itemsADict = new BiDictionary(itemsA)
+    val itemsBDict = new BiDictionary(itemsB)
+
+    // this is downsampled to the top 2 values per row to match the calc
+    val matrixLLRCoocAtBNonSymmetric = dense(
+      (0.0,                1.7260924347106847, 1.7260924347106847, 0.0),
+      (0.0,                0.6795961471815897, 0.6795961471815897, 0.0),
+      (1.7260924347106847, 0.6795961471815897, 0.0,                0.0),
+      (5.545177444479561,  1.7260924347106847, 0.0,                0.0),
+      (0.0,                0.0,                0.6795961471815897, 0.0))
+
+    val drmA = drmParallelize(m = a, numPartitions = 2)
+    val drmB = drmParallelize(m = b, numPartitions = 2)
+
+    val aID = new IndexedDatasetSpark(drmA, userDict, itemsADict)
+    val bID = new IndexedDatasetSpark(drmB, userDict, itemsBDict)
+    val aD = DownsamplableCrossOccurrenceDataset(aID)
+    val bD = DownsamplableCrossOccurrenceDataset(bID, maxInterestingElements = 
2)
+
+    //self similarity
+    val drmCooc = SimilarityAnalysis.crossOccurrenceDownsampled(List(aD, bD))
+    val matrixSelfCooc = drmCooc(0).matrix.checkpoint().collect
+    val diffMatrix = matrixSelfCooc.minus(matrixLLRCoocAtAControl)
+    var n = (new MatrixOps(m = diffMatrix)).norm
+    n should be < 1E-10
+
+    //cross similarity
+    val matrixCrossCooc = drmCooc(1).matrix.checkpoint().collect
+    val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocAtBNonSymmetric)
+    n = (new MatrixOps(m = diff2Matrix)).norm
+    n should be < 1E-10
+  }
+
+  test("Cross-occurrence two IndexedDatasets LLR threshold"){
+    val a = dense(
+      (1, 1, 0, 0, 0),
+      (0, 0, 1, 1, 0),
+      (0, 0, 0, 0, 1),
+      (1, 0, 0, 1, 0))
+
+    val b = dense(
+      (0, 1, 1, 0),
+      (1, 1, 1, 0),
+      (0, 0, 1, 0),
+      (1, 1, 0, 1))
+
+    val users = Seq("u1", "u2", "u3", "u4")
+    val itemsA = Seq("a1", "a2", "a3", "a4", "a5")
+    val itemsB = Seq("b1", "b2", "b3", "b4")
+    val userDict = new BiDictionary(users)
+    val itemsADict = new BiDictionary(itemsA)
+    val itemsBDict = new BiDictionary(itemsB)
+
+    // this is downsampled to the top 2 values per row to match the calc but 
also uses a min llr threshold so
+    // the # per row is still applied but nothing gets past the min llr check
+    val matrixLLRCoocAtBNonSymmetric = dense(
+      (0.0,                1.7260924347106847, 1.7260924347106847, 0.0),
+      (0.0,                0.0,                0.0,                0.0),
+      (1.7260924347106847, 0.0,                0.0,                0.0),
+      (5.545177444479561,  1.7260924347106847, 0.0,                0.0),
+      (0.0,                0.0,                0.0,                0.0))
+
+    val drmA = drmParallelize(m = a, numPartitions = 2)
+    val drmB = drmParallelize(m = b, numPartitions = 2)
+
+    val aID = new IndexedDatasetSpark(drmA, userDict, itemsADict)
+    val bID = new IndexedDatasetSpark(drmB, userDict, itemsBDict)
+    val aD = DownsamplableCrossOccurrenceDataset(aID)
+    val bD = DownsamplableCrossOccurrenceDataset(bID, minLLROpt = Some(1.7), 
maxInterestingElements = 2)
+
+    //self similarity
+    val drmCooc = SimilarityAnalysis.crossOccurrenceDownsampled(List(aD, bD))
+    val matrixSelfCooc = drmCooc(0).matrix.checkpoint().collect
+    val diffMatrix = matrixSelfCooc.minus(matrixLLRCoocAtAControl)
+    var n = (new MatrixOps(m = diffMatrix)).norm
+    n should be < 1E-10
+
+    //cross similarity
+    val matrixCrossCooc = drmCooc(1).matrix.checkpoint().collect
+    val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocAtBNonSymmetric)
+    n = (new MatrixOps(m = diff2Matrix)).norm
+    n should be < 1E-10
+  }
+
   test("LLR calc") {
     val A = dense(
         (1, 1, 0, 0, 0),

mahout git commit: MAHOUT-1853: Add new thresholds and partitioning methods to SimilarityAnalysis

Reply via email to