Repository: mahout Updated Branches: refs/heads/master 3351b75b3 -> b5fe4aab2
MAHOUT-1853: Add new thresholds and partitioning methods to SimilarityAnalysis Project: http://git-wip-us.apache.org/repos/asf/mahout/repo Commit: http://git-wip-us.apache.org/repos/asf/mahout/commit/b5fe4aab Tree: http://git-wip-us.apache.org/repos/asf/mahout/tree/b5fe4aab Diff: http://git-wip-us.apache.org/repos/asf/mahout/diff/b5fe4aab Branch: refs/heads/master Commit: b5fe4aab22e7867ae057a6cdb1610cfa17555311 Parents: 3351b75 Author: pferrel <[email protected]> Authored: Tue Sep 13 13:02:14 2016 -0700 Committer: pferrel <[email protected]> Committed: Tue Sep 13 13:02:14 2016 -0700 ---------------------------------------------------------------------- CHANGELOG | 627 ------------------- .../mahout/math/cf/SimilarityAnalysis.scala | 192 +++++- .../mahout/cf/SimilarityAnalysisSuite.scala | 125 +++- 3 files changed, 272 insertions(+), 672 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/mahout/blob/b5fe4aab/CHANGELOG ---------------------------------------------------------------------- diff --git a/CHANGELOG b/CHANGELOG deleted file mode 100644 index 5cd8af5..0000000 --- a/CHANGELOG +++ /dev/null @@ -1,627 +0,0 @@ -Mahout Change Log - -Release 0.12.0 - unreleased - - MAHOUT-1775: FileNotFoundException caused by aborting the process of downloading Wikipedia dataset (Bowei Zhang via smarthi) - - MAHOUT-1771: Cluster dumper omits indices and 0 elements for dense vector or sparse containing 0s (srowen) - - MAHOUT-1613: classifier.df.tools.Describe does not handle -D parameters (haohui mai via smarthi) - - MAHOUT-1642: Iterator class within SimilarItems class always misses the first element (Oleg Zotov via smarthi) - - MAHOUT-1675: Remove MLP from codebase (ZJaffe via smarthi) - -Release 0.11.0 - 2015-08-07 - - MAHOUT-1744: Deprecate lucene2seq (apalumbo) - - MAHOUT-1761: Upgraded to Apache parent pom v17 (sslavic) - - MAHOUT-1745: Purge deprecated ConcatVectorsJob from codebase (apalumbo) - - MAHOUT-1757: small fix in spca formula (smarthi) - - MAHOUT-1756: Missing +=: and *=: operators on vectors (smarthi) - - NOJIRA: Clean up CLI help for spark-rowsimilarity and fixed test that intermitently failed (pferrel) - - MAHOUT-1685: Move Mahout shell to Spark 1.3+ (dlyubimov, apalumbo) - - MAHOUT-1653: Spark 1.3 (pferrel, apalumbo) - - MAHOUT-1754: Distance and squared distance matrices routines (dlyubimov) - - MAHOUT-1753: First and second moment routines (dlyubimov) - - MAHOUT-1746: mxA ^ 2, mxA ^ 0.5 to mean the same thing as mxA * mxA and mxA ::= sqrt _ (dlyubimov) - - MAHOUT-1736: Implement allreduceBlock() on H2O (avati) - - MAHOUT-1752: Implement CbindScalar operator on H2O (avati) - - MAHOUT-1660: Hadoop1HDFSUtil.readDRMHEader should be taking Hadoop conf (dlyubimov) - - MAHOUT-1713: Performance and parallelization improvements for AB', A'B, A'A spark physical operators (dlyubimov) - - MAHOUT-1714: Add MAHOUT_OPTS environment when running Spark shell (dlyubimov) - - MAHOUT-1715: Closeable API for broadcast tensors (dlyubimov) - - MAHOUT-1716: Scala logging style (dlyubimov) - - MAHOUT-1717: allreduceBlock() operator api and Spark implementation (dlyubimov) - - MAHOUT-1718: Support for conversion of any type-keyed DRM into ordinally-keyed DRM (dlyubimov) - - MAHOUT-1719: Unary elementwise function operator and function fusions (dlyubimov) - - MAHOUT-1720: Support 1 cbind X, X cbind 1 etc. for both Matrix and DRM (dlyubimov) - - MAHOUT-1721: rowSumsMap() summary for non-int-keyed DRMs (dlyubimov) - - MAHOUT-1722: DRM row sampling api (dlyubimov) - - MAHOUT-1723: Optional structural "flavor" abstraction for in-core matrices (dlyubimov) - - MAHOUT-1724: Optimizations of matrix-matrix in-core multiplication based on structural flavors (dlyubimov) - - MAHOUT-1725: elementwise power operator ^ (dlyubimov) - - MAHOUT-1726: R-like vector concatenation operator (dlyubimov) - - MAHOUT-1727: Elementwise analogues of scala.math functions for tensor types (dlyubimov) - - MAHOUT-1728: In-core functional assignments (dlyubimov) - - MAHOUT-1729: Straighten out behavior of Matrix.iterator() and iterateNonEmpty() (dlyubimov) - - MAHOUT-1730: New mutable transposition view for in-core matrices (dlyubimov) - - MAHOUT-1731: Deprecate SparseColumnMatrix (dlyubimov) - - MAHOUT-1732: Native support for kryo serialization of tensor types (dlyubimov) - -Release 0.10.1 - 2015-05-31 - - MAHOUT-1704: Pare down dependency jar for h2o (apalumbo) - - MAHOUT-1697: Fixed paths to which math-scala and spark modules docs get packaged under in bin distribution archive (sslavic) - - MAHOUT-1696: QRDecomposition.solve(...) can return incorrect Matrix types (apalumbo) - - MAHOUT-1690: CLONE - Some vector dumper flags are expecting arguments. (smarthi) - - MAHOUT-1693: FunctionalMatrixView materializes row vectors in scala shell (apalumbo) - - MAHOUT-1680: Renamed mahout-distribution to apache-mahout-distribution (sslavic) - -Release 0.10.0 - 2015-04-11 - - MAHOUT-1630: Incorrect SparseColumnMatrix.numSlices() causes IndexException in toString() (Oleg Nitz, smarthi) - - MAHOUT-1665: Update hadoop commands in example scripts (akm) - - MAHOUT-1676: Deprecate MLP, ConcatenateVectorsJob and ConcatenateVectorsReducer in the codebase (apalumbo) - - MAHOUT-1622: MultithreadedBatchItemSimilarities outputs incorrect number of similarities (Jesse Daniels, Anand Avati via smarthi) - - MAHOUT-1605: Make VisualizerTest locale independent (Frank Rosner, Anand Avati via smarthi) - - MAHOUT-1635: Getting an exception when I provide classification labels manually for Naive Bayes (apalumbo) - - MAHOUT-1662: Potential Path bug in SequenceFileVaultIterator breaks DisplaySpectralKMeans (Shannon Quinn) - - MAHOUT-1656: Change SNAPSHOT version from 1.0 to 0.10.0 (smarthi) - - MAHOUT-1593: cluster-reuters.sh does not work complaining java.lang.IllegalStateException (smarthi via akm) - - MAHOUT-1661: All Lanczos modules marked as @Deprecated and slated for removal in future releases (Shannon Quinn) - - MAHOUT-1638: H2O bindings fail at drmParallelizeWithRowLabels(...) (Anand Avati via apalumbo) - - MAHOUT-1667: Hadoop 1 and 2 profile in POM (sslavic) - - MAHOUT-1564: Naive Bayes Classifier for New Text Documents (apalumbo) - - MAHOUT-1524: Script to auto-generate and view the Mahout website on a local machine (Saleem Ansari via apalumbo) - - MAHOUT-1589: Deprecate mahout.cmd due to lack of support - - MAHOUT-1655: Refactors mr-legacy into mahout-hdfs and mahout-mr, Spark now depends on much reduced mahout-hdfs - - MAHOUT-1522: Handle logging levels via log4j.xml (akm) - - MAHOUT-1602: Euclidean Distance Similarity Math (Leonardo Fernandez Sanchez, smarthi) - - MAHOUT-1619: HighDFWordsPruner overwrites cache files (Burke Webster, smarthi) - - MAHOUT-1516: classify-20newsgroups.sh failed: /tmp/mahout-work-jpan/20news-all does not exists in hdfs. (Jian Pan via apalumbo) - - MAHOUT-1559: Add documentation for and clean up the wikipedia classifier example (apalumbo) - - MAHOUT-1598: extend seq2sparse to handle multiple text blocks of same document (Wolfgang Buchnere via akm) - - MAHOUT-1659: Remove deprecated Lanczos solver from spectral clustering in mr-legacy (Shannon Quinn) - - MAHOUT-1612: NullPointerException happens during JSON output format for clusterdumper (smarthi, Manoj Awasthi) - - MAHOUT-1652: Java 7 update (smarthi) - - MAHOUT-1639: Streaming kmeans doesn't properly validate estimatedNumMapClusters -km (smarthi) - - MAHOUT-1493: Port Naive Bayes to Scala DSL (apalumbo) - - MAHOUT-1611: Preconditions.checkArgument in org.apache.mahout.utils.ConcatenateVectorsJob (Haishou Ma via smarthi) - - MAHOUT-1615: SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles (Anand Avati, dlyubimov, apalumbo) - - MAHOUT-1610: Update tests to pass in Java 8 (srowen) - - MAHOUT-1608: Add option in WikipediaToSequenceFile to remove category labels from documents (apalumbo) - - MAHOUT-1604: Spark version of rowsimilarity driver and associated additions to SimilarityAnalysis.scala (pferrel) - - MAHOUT-1500: H2O Integration (Anand Avati via apalumbo) - - MAHOUT-1606 - Add rowSums, rowMeans and diagonal extraction operations to distributed matrices (dlyubimov) - - MAHOUT-1603: Tweaks for Spark 1.0.x (dlyubimov & pferrel) - - MAHOUT-1596: implement rbind() operator (Anand Avati and dlyubimov) - - MAHOUT-1597: A + 1.0 (element-wise scala operation) gives wrong result if rdd is missing rows, Spark side (dlyubimov) - - MAHOUT-1595: MatrixVectorView - implement a proper iterateNonZero() (Anand Avati via dlyubimov) - - MAHOUT-1590 Mahout unit test failures due to guava version conflict on hadoop 2 (Venkat Ranganathan via sslavic) - - MAHOUT-1529(e): Move dense/sparse matrix test in mapBlock into spark (Anand Avati via dlyubimov) - - MAHOUT-1583: cbind() operator for Scala DRMs (dlyubimov) - - MAHOUT-1563: Eliminated warnings about multiple scala versions (sslavic) - - MAHOUT-1541, MAHOUT-1568, MAHOUT-1569: Created text-delimited file I/O traits and classes on spark, a MahoutDriver for a CLI and a ItemSimilairtyDriver using the CLI - - MAHOUT-1573: More explicit parallelism adjustments in math-scala DRM apis; elements of automatic parallelism management (dlyubimov) - - MAHOUT-1580: Optimize getNumNonZeroElements() (ssc) - - MAHOUT-1464: Cooccurrence Analysis on Spark (pat) - - MAHOUT-1578: Optimizations in matrix serialization (ssc) - - MAHOUT-1572: blockify() to detect (naively) the data sparsity in the loaded data (dlyubimov) - - MAHOUT-1571: Functional Views are not serialized as dense/sparse correctly (dlyubimov) - - MAHOUT-1566: (Experimental) Regular ALS factorizer with conversion tests, optimizer enhancements and bug fixes (dlyubimov) - - MAHOUT-1537: Minor fixes to spark-shell (Anand Avati via dlyubimov) - - MAHOUT-1529: Finalize abstraction of distributed logical plans from backend operations (dlyubimov) - - MAHOUT-1489: Interactive Scala & Spark Bindings Shell & Script processor (dlyubimov) - - MAHOUT-1346: Spark Bindings (DRM) (dlyubimov) - - MAHOUT-1555: Exception thrown when a test example has the label not present in training examples (Karol Grzegorczyk via smarthi) - - MAHOUT-1446: Create an intro for matrix factorization (Jian Wang via ssc) - - MAHOUT-1480: Clean up website on 20 newsgroups (Andrew Palumbo via ssc) - - MAHOUT-1561: cluster-syntheticcontrol.sh not running locally with MAHOUT_LOCAL=true (Andrew Palumbo via ssc) - - MAHOUT-1558: Clean up classify-wiki.sh and add in a binary classification problem (Andrew Palumbo via ssc) - - MAHOUT-1560: Last batch is not filled correctly in MultithreadedBatchItemSimilarities (JarosÅaw Bojar) - - MAHOUT-1554: Provide more comprehensive classification statistics (Karol Grzegorczyk via ssc) - - MAHOUT-1548: Fix broken links in quickstart webpage (Andrew Palumbo via ssc) - - MAHOUT-1542: Tutorial for playing with Mahout's Spark shell (ssc) - - MAHOUT-1533: Remove Frequent Pattern Mining (ssc) - - MAHOUT-1532: Add solve() function to the Scala DSL (ssc) - - MAHOUT-1530: Custom prompt and welcome message for the Spark Shell (ssc) - - MAHOUT-1527: Fix wikipedia classifier example (Andrew Palumbo via ssc) - - MAHOUT-1526: Ant file in examples (ssc) - - MAHOUT-1523: Remove @author tags in sparkbindings (ssc) - - MAHOUT-1521: lucene2seq - Error trying to load data from stored field (when non-indexed) (Terry Blankers via frankscholten) - - MAHOUT-1520: Fix links in Mahout website documentation (Saleem Ansari via smarthi) - - MAHOUT-1519: Remove StandardThetaTrainer (Andrew Palumbo via ssc) - - MAHOUT-1517: Remove casts to int in ALSWRFactorizer (ssc) - - MAHOUT-1513: Deprecate Canopy Clustering (ssc) - - MAHOUT-1511: Renaming core to mrlegacy (frankscholten) - - MAHOUT-1510: Goodbye MapReduce (ssc) - - MAHOUT-1509: Invalid URL in link from "quick start/basics" page (Nick Martin, smarthi) - - MAHOUT-1508: Performance problems with sparse matrices (ssc) - - MAHOUT-1505: structure of clusterdump's JSON output (akm) - - MAHOUT-1504: Enable/fix thetaSummer job in TrainNaiveBayesJob (Andrew Palumbo, smarthi) - - MAHOUT-1503: TestNaiveBayesDriver fails in sequential mode (Andrew Palumbo, smarthi) - - MAHOUT-1502: Update Naive Bayes Webpage to Current Implementation (Andrew Palumbo via ssc) - - MAHOUT-1501: ClusterOutputPostProcessorDriver has private default constructor (ssc) - - MAHOUT-1498: DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie (Sergey via ssc) - - MAHOUT-1497: mahout resplit not producing splited files (ssc) - - MAHOUT-1496: Create a website describing the distributed ALS recommender (Jian Wang via ssc) - - MAHOUT-1491: Spectral KMeans Clustering doesn't clean its /tmp dir and fails when seeing it again (smarthi) - - MAHOUT-1488: DisplaySpectralKMeans fails: examples/output/clusteredPoints/part-m-00000 does not exist (Saleem Ansari via smarthi) - - MAHOUT-1483: Organize links in web site navigation bar (akm) - - MAHOUT-1482: Rework quickstart website (Jian Wang via ssc) - - MAHOUT-1476: Cleanup website on Hidden Markov Models (akm) - - MAHOUT-1475: Cleanup website on Naive Bayes (smarthi) - - MAHOUT-1472: Cleanup website on fuzzy kmeans (smarthi) - - MAHOUT-1471: Cleanup website for Canopy clustering (smarthi) - - MAHOUT-1468: Creating a new page for StreamingKMeans documentation on mahout website (Maxim Arap and Pavan Kumar via akm) - - MAHOUT-1467: ClusterClassifier readPolicy leaks file handles (Avi Shinnar, smarthi) - - MAHOUT-1466: Cluster visualization fails to execute (ssc) - - MAHOUT-1465: Clean up README (akm) - - MAHOUT-1463: Modify OnlineSummarizers to use the TDigest dependency from Maven Central (tdunning, smarthi) - - MAHOUT-1460: Remove reference to Dirichlet in ClusterIterator (frankscholten) - - MAHOUT-1459: Move Hadoop related code out of CanopyClusterer (frankscholten) - - MAHOUT-1458: Remove KMeansConfigKeys and FuzzyKMeansConfigKeys (frankscholten) - - MAHOUT-1457: Move EigenSeedGenerator into spectral kmeans package (frankscholten) - - MAHOUT-1455: Forkcount config causes JVM crashes during build (frankscholten) - - MAHOUT-1451: Cleaning up the examples for clustering on the website (Gaurav Misra via ssc) - - MAHOUT-1450: Cleaning up clustering documentation on mahout website (Pavan Kumar) - - MAHOUT-1449: Update the Known Issues in Random Forests Page (Manoj Awasthi via ssc) - - MAHOUT-1448: In Random Forest, the training does not support multiple input files. The input dataset must be one single file. (Manoj Awasthi via ssc) - - MAHOUT-1447: ImplicitFeedbackAlternatingLeastSquaresSolver tests and features (Adam Ilardi via ssc) - - MAHOUT-1445: Create an intro for item based recommender (Nick Martin via ssc) - - MAHOUT-1440: Add option to set the RNG seed for inital cluster generation in Kmeans/fKmeans (Andrew Palumbo via ssc) - - MAHOUT-1438: "quickstart" tutorial for building a simple recommender (Maciej Mazur and Steve Cook via ssc) - - MAHOUT-1434: Dead links on the web ste (Kevin Moulart, smarthi) - - MAHOUT-1433: Make SVDRecommender look at all unknown items of a user per default (ssc) - - MAHOUT-1429: Parallelize YtransposeY in ImplicitFeedbackAlternatingLeastSquaresSolver (Adam Ilardi via ssc) - - MAHOUT-1428: Recommending already consumed items (Dodi Hakim via ssc) - - MAHOUT-1425: SGD classifier example with bank marketing dataset. (frankscholten) - - MAHOUT-1420: Add solr-recommender to examples (Pat Ferrel via akm) - - MAHOUT-1419: Random decision forest is excessively slow on numeric features (srowen) - - MAHOUT-1417: Random decision forest implementation fails in Hadoop 2 (srowen) - - MAHOUT-1416: Make access of DecisionForest.read(dataInput) less restricted (Manoj Awasthi via smarthi) - - MAHOUT-1415: Clone method on sparse matrices fails if there is an empty row which has not been set explicitly (till.rohrmann via ssc) - - MAHOUT-1413: Rework Algorithms page (ssc) - - MAHOUT-1388: Add command line support and logging for MLP (Yexi Jiang via ssc) - - MAHOUT-1385: Caching Encoders don't cache (Johannes Schulte, Manoj Awasthi via ssc) - - MAHOUT-1356: Ensure unit tests fail fast when writing outside mvn target directory (isabel, smarthi, dweiss, frankscholten, akm) - - MAHOUT-1329: Mahout for hadoop 2 (gcapan, Sergey Svinarchuk) - - MAHOUT-1310: Mahout support windows (Sergey Svinarchuk via ssc) - - MAHOUT-1278: Upgraded to apache parent pom version 16 (sslavic) - -Release 0.9 - 2014-02-01 - - MAHOUT-1387: Create page for release notes (ssc) - - MAHOUT-1411: Random test failures from TDigestTest (smarthi) - - MAHOUT-1410: clusteredPoints do not contain a vector id (smarthi, Andrew Musselman) - - MAHOUT-1409: MatrixVectorView has index check error (tdunning) - - MAHOUT-1402: Zero clusters using streaming k-means option in cluster-reuters.sh (smarthi) - - MAHOUT-1401: Resurrect Frequent Pattern mining (smarthi) - - MAHOUT-1400: Remove references to deprecated and removed algorithms from examples scripts (ssc) - - MAHOUT-1399: Fixed multiple slf4j bindings when running Mahout examples issue (sslavic) - - MAHOUT-1398: FileDataModel should provide a constructor with a delimiterPattern (Roy Guo via ssc) - - MAHOUT-1396: Accidental use of commons-math won't work with next Hadoop 2 release (srowen) - - MAHOUT-1394: Undeprecate Lanczos (ssc) - - MAHOUT-1393: Remove duplicated code from getTopTerms and getTopFeatures in AbstractClusterWriter (Diego Carrion via smarthi) - - MAHOUT-1392: Streaming KMeans should write centroid output to a 'part-r-xxxx' file when executed in sequential mode (smarthi) - - MAHOUT-1390: SVD hangs for certain inputs (tdunning) - - MAHOUT-1389: Complementary Naive Bayes Classifier not getting called when "-c" option is activated (Gouri Shankar Majumdar via smarthi) - - MAHOUT-1384: Executing the MR version of Naive Bayes/CNB of classify_20newgroups.sh fails in seqdirectory step (smarthi) - - MAHOUT-1382: Upgrade Mahout third party jars for 0.9 Release (smarthi) - - MAHOUT-1380: Streaming KMeans fails when executed in Sequential Mode (smarthi) - - MAHOUT-1379: ClusterQualitySummarizer fails with the new T-Digest for clusters with 1 data point (smarthi) - - MAHOUT-1378: Running Random Forest with Ignored features fails when loading feature descriptor from JSON file (Sam Wu via smarthi) - - MAHOUT-1377: Exclude JUnit.jar from tarball (Sergey Svinarchuk via smarthi) - - MAHOUT-1374: Ability to provide input file with userid, itemid pair (Aliaksei Litouka via ssc) - - MAHOUT-1371: Arff loader can misinterpret nominals with integer, real or string (Mansur Iqbal via smarthi) - - MAHOUT-1370: Vectordump doesn't write to output file in MapReduce Mode (smarthi) - - MAHOUT-1368: Convert OnlineSummarizer to use the new TDigest (tdunning) - - MAHOUT-1367: WikipediaXmlSplitter --> Exception in thread "main" java.lang.NullPointerException (smarthi) - - MAHOUT-1364: Upgrade Mahout codebase to Lucene 4.6 (Frank Scholten) - - MAHOUT-1363: Rebase packages in mahout-scala (dlyubimov) - - MAHOUT-1362: Remove examples/bin/build-reuters.sh (smarthi) - - MAHOUT-1361: Online algorithm for computing accurate Quantiles using 1-D clustering (tdunning) - - MAHOUT-1358: StreamingKMeansThread throws IllegalArgumentException when REDUCE_STREAMING_KMEANS is set to true (smarthi) - - MAHOUT-1355: InteractionValueEncoder produces wrong traceDictionary entries (Johannes Schulte via smarthi) - - MAHOUT-1353: Visibility of preparePreferenceMatrix directory location (Pat Ferrel, ssc) - - MAHOUT-1352: Option to change RecommenderJob output format (Pat Ferrel, ssc) - - MAHOUT-1351: Adding DenseVector support to AbstractCluster (David DeBarr via smarthi) - - MAHOUT-1349: Clusterdumper/loadTermDictionary crashes when highest index in (sparse) dictionary vector is larger than dictionary vector size (Andrew Musselman via smarthi) - - MAHOUT-1347: Add Streaming K-Means clustering algorithm to examples/bin/cluster-reuters.sh (smarthi) - - MAHOUT-1345: Enable randomised testing for all Mahout modules (Dawid Weiss, Isabel, sslavic, Frank Scholten, smarthi) - - MAHOUT-1343: JSON output format support in cluster dumper (Telvis Calhoun via sslavic) - - MAHOUT-1333: Fixed examples bin directory permissions in distribution archives (Mike Percy via sslavic) - - MAHOUT-1319: seqdirectory -filter argument silently ignored when run as MR (smarthi) - - MAHOUT-1317: Clarify some of the messages in Preconditions.checkArgument (Nikolai Grinko, smarthi) - - MAHOUT-1314: StreamingKMeansReducer throws NullPointerException when REDUCE_STREAMING_KMEANS is set to true (smarthi) - - MAHOUT-1313: Fixed unwanted integral division bug in RowSimilarityJob downsampling code where precision should have been retained (sslavic) - - MAHOUT-1312: LocalitySensitiveHashSearch does not limit search results (sslavic) - - MAHOUT-1308: Cannot extend CandidateItemsStrategy due to restricted visibility (David Geiger, smarthi) - - MAHOUT-1301: toString() method of SequentialAccessSparseVector has excess comma at the end (Alexander Senov, smarthi) - - MAHOUT-1297: New module for linear algebra scala DSL (dlyubimov) - - MAHOUT-1296: Remove deprecated algorithms (ssc) - - MAHOUT-1295: Excluded all Maven's target directories from distribution archives (sslavic) - - MAHOUT-1294: Cleanup previously installed artifacts from CI server local repository (sslavic) - - MAHOUT-1293: Source distribution tar.gz archive cannot be unpacked on Linux (sslavic) - - MAHOUT-1292: lucene2seq should validate the 'id' field (Frank Scholten via smarthi) - - MAHOUT-1291: MahoutDriver yields cosmetically suboptimal exception when bin/mahout runs without args, on some Hadoop versions (srowen) - - MAHOUT-1290: Issue when running Mahout Recommender Demo (Helder Garay Martins via smarthi) - - MAHOUT-1289: Move downsampling code into RowSimilarityJob (ssc) - - MAHOUT-1287: classifier.sgd.CsvRecordFactory incorrectly parses CSV format (Alex Franchuk via smarthi) - - MAHOUT-1285: Arff loader can misparse string data as double (smarthi) - - MAHOUT-1284: DummyRecordWriter's bug with reused Writables (Maysam Yabandeh via smarthi) - - MAHOUT-1275: Dropped bz2 distribution format for source and binaries (sslavic) - - MAHOUT-1265: Multilayer Perceptron (Yexi Jiang via smarthi) - - MAHOUT-1261: TasteHadoopUtils.idToIndex can return an int that has size Integer.MAX_VALUE (Carl Clark, smarthi) - - MAHOUT-1242: No key redistribution function for associative maps (Tharindu Rusira via smarthi) - - MAHOUT-1030: Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable (Andrew Musselman, Pat Ferrel, Jeff Eastman, Lars Norskog, smarthi) - -Release 0.8 - 2013-07-25 - - MAHOUT-1272: Parallel SGD matrix factorizer for SVDrecommender (Peng Cheng via ssc) - - MAHOUT-1271: classify-20newsgroups.sh fails during the seqdirectory step (smarthi) - - MAHOUT-1269: Cleanup deprecated Lucene 3.x API calls in lucene2seq utility unit tests (smarthi) - - MAHOUT-833: Make conversion to sequence files map-reduce (Josh Patterson, smarthi) - - MAHOUT-1268: Wrong output directory for CVB (Mark Wicks via ssc) - - MAHOUT-1264: Performance optimizations in RecommenderJob (ssc) - - MAHOUT-1262: Cleanup LDA code (ssc) - - MAHOUT-1255: Fix for weights in Multinomial sometimes overflowing in BallKMeans (dfilimon) - - MAHOUT-1254: Final round of cleanup for StreamingKMeans (dfilimon) - - MAHOUT-1263: Serialise/Deserialise Lambda value for OnlineLogisticRegression (Mike Davy via smarthi) - - MAHOUT-1258: Another shot at findbugs and checkstyle (ssc) - - MAHOUT-1253: Add experiment tools for StreamingKMeans, part 1 (dfilimon) - - MAHOUT-884: Matrix Concatenate Utility (Lance Norskog via smarthi) - - MAHOUT-1250: Deprecate unused algorithms (ssc) - - MAHOUT-1251: Optimize MinHashMapper (ssc) - - MAHOUT-1211: Disabled swallowing of IOExceptions is Closeables.close for writers (dfilimon) - - MAHOUT-1164: Make ARFF integration generate meta-data in JSON format (Marty Kube via ssc) - - MAHOUT-1164: Make ARFF integration generate meta-data in JSON format (Marty Kube via ssc) - - MAHOUT-1163: Make random forest classifier meta-data file human readable (Marty Kube via ssc) - - MAHOUT-1243: Dictionary file format in Lucene-Mahout integration is not in SequenceFileFormat (ssc) - - MAHOUT-974: org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob use integer as userId and itemId (ssc) - - MAHOUT-1052: Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values) (Elena Smirnova via smarthi) - - MAHOUT-1237: Total cluster cost isn't computed properly (dfilimon) - - MAHOUT-1196: LogisticModelParameters uses csv.getTargetCategories() even if csv is not used. (Vineet Krishnan via ssc) - - MAHOUT-1224: Add the option of running a StreamingKMeans pass in the Reducer before BallKMeans (dfilimon) - - MAHOUT-993: Some vector dumper flags are expecting arguments. (Andrew Look via robinanil) - - MAHOUT-1228: Cleanup .gitignore (Stevo Slavic via ssc) - - MAHOUT-1047: CVB hangs after completion (Angel Martinez Gonzalez via smarthi) - - MAHOUT-1235: ParallelALSFactorizationJob does not use VectorSumCombiner (ssc) - - MAHOUT-1230: SparceMatrix.clone() is not deep copy (Maysam Yabandeh via tdunning) - - MAHOUT-1232: VectorHelper.topEntries() throws a NPE when number of NonZero elements in vector < maxEntries (smarthi) - - MAHOUT-1229: Conf directory content from Mahout distribution archives cannot be unpacked (Stevo Slavic via smarthi) - - MAHOUT-1213: SSVD job doesn't clean it's temp dir, and fails when seeing it again (smarthi) - - MAHOUT-1223: Fixed point skipped in StreamingKMeans when iterating through centroids from a reducer (dfilimon) - - MAHOUT-1222: Fix total weight in FastProjectionSearch (dfilimon) - - MAHOUT-1219: Remove LSHSearcher from StreamingKMeansTest. It causes it to sometimes fail (dfilimon) - - MAHOUT-1221: SparseMatrix.viewRow is sometimes readonly. (Maysam Yabandeh via smarthi) - - MAHOUT-1219: Remove LSHSearcher from SearchQualityTest. It causes it to fail, but the failure is not very meaningful (dfilimon) - - MAHOUT-1217: Nearest neighbor searchers sometimes fail to remove points: fix in FastProjectionSearch's searchFirst (dfilimon) - - MAHOUT-1216: Add locality sensitive hashing and a LocalitySensitiveHash searcher (dfilimon) - - MAHOUT-1181: Adding StreamingKMeans MapReduce classes (dfilimon) - - MAHOUT-1212: Incorrect classify-20newsgroups.sh file description (Julian Ortega via smarthi) - - MAHOUT-1209: DRY out maven-compiler-plugin configuration (Stevo Slavic via smarthi) - - MAHOUT-1207: Fix typos in description in parent pom (Stevo Slavic via smarthi) - - MAHOUT-1199: Improve javadoc comments of mahout-integration (Angel Martinez Gonzalez via smarthi) - - MAHOUT-1162: Adding BallKMeans and StreamingKMeans clustering algorithms (dfilimon) - - MAHOUT-1205: ParallelALSFactorizationJob should leverage the distributed cache (ssc) - - MAHOUT-1156: Adding nearest neighbor Searchers (dfilimon) - - MAHOUT-1202: Speed up Vector operations (dfilimon) - - MAHOUT-1155: Make MatrixSlice a Vector (and fix Centroid cloning; MAHOUT-1202) (dfilimon) - - MAHOUT-1189: CosineDistanceMeasure doesn't return 0 for two 0 vectors (dfilimon) - - MAHOUT-1180: Multinomial<T> throws ConcurrentModificationException when iterating and setting probabilities (dfilimon) - - MAHOUT-1192: Speed up Vector Operations (robinanil) - - MAHOUT-1191: Cleanup Vector Benchmarks make it less variable (robinanil) - - MAHOUT-1190: SequentialAccessSparseVector function assignment is very slow and other iterator woes (robinanil) - - MAHOUT-1188: Inconsistent reference to Lucene versions in code and POM (smarthi) - - MAHOUT-1161: Unable to run CJKAnalyzer for conversion of a sequence file to sparse vector due to instantiation exception (ssc) - - MAHOUT-1187: Update Commons Lang to Commons Lang3 (smarthi) - - MAHOUT-1184 Another take at pmd, findbugs and checkstyle (ssc) - - MAHOUT-1182: Remove useless append (Dave Brosius via tdunning) - - MAHOUT-1176: Introduce a changelog file to raise contributors attribution (ssc) - - MAHOUT-1108: Allows cluster-reuters.sh example to be executed on a cluster (elmer.garduno via gsingers) - - MAHOUT-961: Fix issue in decision forest tree visualizer to properly show stems of tree (Ikumasa Mukai via gsingers) - - MAHOUT-944: Create SequenceFiles out of Lucene document storage (no term vectors required) (Frank Scholten, gsingers) - - MAHOUT-958: Fix issue with globs in RepresentativePointsDriver (Adam Baron, Vikram Dixit K, ehgjr via gsingers) - - MAHOUT-1084: Fixed issue with too many clusters in synthetic control example (liutengfei, gsingers) - - MAHOUT-1103: Fixed issue with splitting clusters on Hadoop (Matt Molek, gsingers) - - MAHOUT-1126: Filter out bad META-INF files in job packaging (Pat Ferrel, gsingers) - - MAHOUT-1211: Change deprecated Closeables.closeQuietly calls (smarthi, gsingers, srowen, dlyubimov) http://git-wip-us.apache.org/repos/asf/mahout/blob/b5fe4aab/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala ---------------------------------------------------------------------- diff --git a/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala b/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala index 4632468..a10b942 100644 --- a/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala +++ b/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala @@ -44,23 +44,34 @@ object SimilarityAnalysis extends Serializable { /** Compares (Int,Double) pairs by the second value */ private val orderByScore = Ordering.fromLessThan[(Int, Double)] { case ((_, score1), (_, score2)) => score1 > score2} + lazy val defaultParOpts = ParOpts() + /** * Calculates item (column-wise) similarity using the log-likelihood ratio on A'A, A'B, A'C, ... * and returns a list of similarity and cross-similarity matrices - * @param drmARaw Primary interaction matrix + * + * @param drmARaw Primary interaction matrix * @param randomSeed when kept to a constant will make repeatable downsampling * @param maxInterestingItemsPerThing number of similar items to return per item, default: 50 * @param maxNumInteractions max number of interactions after downsampling, default: 500 + * @param parOpts partitioning params for drm.par(...) * @return a list of [[org.apache.mahout.math.drm.DrmLike]] containing downsampled DRMs for cooccurrence and * cross-cooccurrence */ - def cooccurrences(drmARaw: DrmLike[Int], randomSeed: Int = 0xdeadbeef, maxInterestingItemsPerThing: Int = 50, - maxNumInteractions: Int = 500, drmBs: Array[DrmLike[Int]] = Array()): List[DrmLike[Int]] = { + def cooccurrences( + drmARaw: DrmLike[Int], + randomSeed: Int = 0xdeadbeef, + maxInterestingItemsPerThing: Int = 50, + maxNumInteractions: Int = 500, + drmBs: Array[DrmLike[Int]] = Array(), + parOpts: ParOpts = defaultParOpts) + : List[DrmLike[Int]] = { implicit val distributedContext = drmARaw.context - // backend allowed to optimize partitioning - drmARaw.par(auto = true) + // backend partitioning defaults to 'auto', which is often better decided by calling funciton + // todo: this should ideally be different per drm + drmARaw.par( min = parOpts.minPar, exact = parOpts.exactPar, auto = parOpts.autoPar) // Apply selective downsampling, pin resulting matrix val drmA = sampleDownAndBinarize(drmARaw, randomSeed, maxNumInteractions) @@ -82,8 +93,9 @@ object SimilarityAnalysis extends Serializable { // Now look at cross cooccurrences for (drmBRaw <- drmBs) { - // backend allowed to optimize partitioning - drmBRaw.par(auto = true) + // backend partitioning defaults to 'auto', which is often better decided by calling funciton + // todo: this should ideally be different per drm + drmARaw.par( min = parOpts.minPar, exact = parOpts.exactPar, auto = parOpts.autoPar) // Down-sample and pin other interaction matrix val drmB = sampleDownAndBinarize(drmBRaw, randomSeed, maxNumInteractions).checkpoint() @@ -100,21 +112,11 @@ object SimilarityAnalysis extends Serializable { similarityMatrices = similarityMatrices :+ drmSimilarityAtB drmB.uncache() - - //debug - val atbRows = drmSimilarityAtB.nrow - val atbCols = drmSimilarityAtB.ncol - val i = 0 } // Unpin downsampled interaction matrix drmA.uncache() - //debug - val ataRows = drmSimilarityAtA.nrow - val ataCols = drmSimilarityAtA.ncol - val i = 0 - // Return list of similarity matrices similarityMatrices } @@ -123,23 +125,27 @@ object SimilarityAnalysis extends Serializable { * Calculates item (column-wise) similarity using the log-likelihood ratio on A'A, A'B, A'C, ... and returns * a list of similarity and cross-similarity matrices. Somewhat easier to use method, which handles the ID * dictionaries correctly + * * @param indexedDatasets first in array is primary/A matrix all others are treated as secondary * @param randomSeed use default to make repeatable, otherwise pass in system time or some randomizing seed * @param maxInterestingItemsPerThing max similarities per items * @param maxNumInteractions max number of input items per item + * @param parOpts partitioning params for drm.par(...) * @return a list of [[org.apache.mahout.math.indexeddataset.IndexedDataset]] containing downsampled * IndexedDatasets for cooccurrence and cross-cooccurrence */ - def cooccurrencesIDSs(indexedDatasets: Array[IndexedDataset], - randomSeed: Int = 0xdeadbeef, - maxInterestingItemsPerThing: Int = 50, - maxNumInteractions: Int = 500): + def cooccurrencesIDSs( + indexedDatasets: Array[IndexedDataset], + randomSeed: Int = 0xdeadbeef, + maxInterestingItemsPerThing: Int = 50, + maxNumInteractions: Int = 500, + parOpts: ParOpts = defaultParOpts): List[IndexedDataset] = { val drms = indexedDatasets.map(_.matrix.asInstanceOf[DrmLike[Int]]) val primaryDrm = drms(0) val secondaryDrms = drms.drop(1) val coocMatrices = cooccurrences(primaryDrm, randomSeed, maxInterestingItemsPerThing, - maxNumInteractions, secondaryDrms) + maxNumInteractions, secondaryDrms, parOpts) val retIDSs = coocMatrices.iterator.zipWithIndex.map { case( drm, i ) => indexedDatasets(0).create(drm, indexedDatasets(0).columnIDs, indexedDatasets(i).columnIDs) @@ -148,19 +154,110 @@ object SimilarityAnalysis extends Serializable { } /** + * Calculates item (column-wise) similarity using the log-likelihood ratio on A'A, A'B, A'C, ... and returns + * a list of similarity and cross-occurrence matrices. Somewhat easier to use method, which handles the ID + * dictionaries correctly and contains info about downsampling in each model calc. + * + * @param datasets first in array is primary/A matrix all others are treated as secondary, includes information + * used to downsample the input drm as well as the output llr(A'A), llr(A'B). The information + * is contained in each dataset in the array and applies to the model calculation of A' with + * the dataset. Todo: ignoring absolute threshold for now. + * @param randomSeed use default to make repeatable, otherwise pass in system time or some randomizing seed + * @param parOpts partitioning params for drm.par(...) + * @return a list of [[org.apache.mahout.math.indexeddataset.IndexedDataset]] containing downsampled + * IndexedDatasets for cooccurrence and cross-cooccurrence + */ + def crossOccurrenceDownsampled( + datasets: List[DownsamplableCrossOccurrenceDataset], + randomSeed: Int = 0xdeadbeef): + List[IndexedDataset] = { + + + val crossDatasets = datasets.drop(1) // drop A + val primaryDataset = datasets.head // use A throughout + val drmARaw = primaryDataset.iD.matrix + + implicit val distributedContext = primaryDataset.iD.matrix.context + + // backend partitioning defaults to 'auto', which is often better decided by calling funciton + val parOptsA = primaryDataset.parOpts.getOrElse(defaultParOpts) + drmARaw.par( min = parOptsA.minPar, exact = parOptsA.exactPar, auto = parOptsA.autoPar) + + // Apply selective downsampling, pin resulting matrix + val drmA = sampleDownAndBinarize(drmARaw, randomSeed, primaryDataset.maxElementsPerRow) + + // num users, which equals the maximum number of interactions per item + val numUsers = drmA.nrow.toInt + + // Compute & broadcast the number of interactions per thing in A + val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerColumn) + + // Compute cooccurrence matrix A'A + val drmAtA = drmA.t %*% drmA + + // Compute loglikelihood scores and sparsify the resulting matrix to get the similarity matrix + val drmSimilarityAtA = computeSimilarities(drmAtA, numUsers, primaryDataset.maxInterestingElements, + bcastInteractionsPerItemA, bcastInteractionsPerItemA, crossCooccurrence = false, + minLLROpt = primaryDataset.minLLROpt) + + var similarityMatrices = List(drmSimilarityAtA) + + // Now look at cross cooccurrences + for (dataset <- crossDatasets) { + // backend partitioning defaults to 'auto', which is often better decided by calling funciton + val parOptsB = dataset.parOpts.getOrElse(defaultParOpts) + dataset.iD.matrix.par(min = parOptsB.minPar, exact = parOptsB.exactPar, auto = parOptsB.autoPar) + + // Downsample and pin other interaction matrix + val drmB = sampleDownAndBinarize(dataset.iD.matrix, randomSeed, dataset.maxElementsPerRow).checkpoint() + + // Compute & broadcast the number of interactions per thing in B + val bcastInteractionsPerThingB = drmBroadcast(drmB.numNonZeroElementsPerColumn) + + // Compute cross-cooccurrence matrix A'B + val drmAtB = drmA.t %*% drmB + + val drmSimilarityAtB = computeSimilarities(drmAtB, numUsers, dataset.maxInterestingElements, + bcastInteractionsPerItemA, bcastInteractionsPerThingB, minLLROpt = dataset.minLLROpt) + + similarityMatrices = similarityMatrices :+ drmSimilarityAtB + + drmB.uncache() + } + + // Unpin downsampled interaction matrix + drmA.uncache() + + // Return list of datasets + val retIDSs = similarityMatrices.iterator.zipWithIndex.map { + case( drm, i ) => + datasets(0).iD.create(drm, datasets(0).iD.columnIDs, datasets(i).iD.columnIDs) + } + retIDSs.toList + + } + + /** * Calculates row-wise similarity using the log-likelihood ratio on AA' and returns a DRM of rows and similar rows + * * @param drmARaw Primary interaction matrix * @param randomSeed when kept to a constant will make repeatable downsampling * @param maxInterestingSimilaritiesPerRow number of similar items to return per item, default: 50 * @param maxNumInteractions max number of interactions after downsampling, default: 500 + * @param parOpts partitioning options used for drm.par(...) */ - def rowSimilarity(drmARaw: DrmLike[Int], randomSeed: Int = 0xdeadbeef, maxInterestingSimilaritiesPerRow: Int = 50, - maxNumInteractions: Int = 500): DrmLike[Int] = { + def rowSimilarity( + drmARaw: DrmLike[Int], + randomSeed: Int = 0xdeadbeef, + maxInterestingSimilaritiesPerRow: Int = 50, + maxNumInteractions: Int = 500, + parOpts: ParOpts = defaultParOpts): DrmLike[Int] = { implicit val distributedContext = drmARaw.context - // backend allowed to optimize partitioning - drmARaw.par(auto = true) + // backend partitioning defaults to 'auto', which is often better decided by calling funciton + // todo: should this ideally be different per drm? + drmARaw.par(min = parOpts.minPar, exact = parOpts.exactPar, auto = parOpts.autoPar) // Apply selective downsampling, pin resulting matrix val drmA = sampleDownAndBinarize(drmARaw, randomSeed, maxNumInteractions) @@ -184,6 +281,7 @@ object SimilarityAnalysis extends Serializable { /** * Calculates row-wise similarity using the log-likelihood ratio on AA' and returns a drm of rows and similar rows. * Uses IndexedDatasets, which handle external ID dictionaries properly + * * @param indexedDataset compare each row to every other * @param randomSeed use default to make repeatable, otherwise pass in system time or some randomizing seed * @param maxInterestingSimilaritiesPerRow max elements returned in each row @@ -211,9 +309,17 @@ object SimilarityAnalysis extends Serializable { } - def computeSimilarities(drm: DrmLike[Int], numUsers: Int, maxInterestingItemsPerThing: Int, - bcastNumInteractionsB: BCast[Vector], bcastNumInteractionsA: BCast[Vector], - crossCooccurrence: Boolean = true) = { + def computeSimilarities( + drm: DrmLike[Int], + numUsers: Int, + maxInterestingItemsPerThing: Int, + bcastNumInteractionsB: BCast[Vector], + bcastNumInteractionsA: BCast[Vector], + crossCooccurrence: Boolean = true, + minLLROpt: Option[Double] = None) = { + + val minLLR = minLLROpt.getOrElse(0.0d) // accept all values if not specified + drm.mapBlock() { case (keys, block) => @@ -245,11 +351,13 @@ object SimilarityAnalysis extends Serializable { // val candidate = thingA -> normailizedLLR // Enqueue item with score, if belonging to the top-k - if (topItemsPerThing.size < maxInterestingItemsPerThing) { - topItemsPerThing.enqueue(candidate) - } else if (orderByScore.lt(candidate, topItemsPerThing.head)) { - topItemsPerThing.dequeue() - topItemsPerThing.enqueue(candidate) + if(candidate._2 >= minLLR) { // llr threshold takes precedence over max per row + if (topItemsPerThing.size < maxInterestingItemsPerThing) { + topItemsPerThing.enqueue(candidate) + } else if (orderByScore.lt(candidate, topItemsPerThing.head)) { + topItemsPerThing.dequeue() + topItemsPerThing.enqueue(candidate) + } } } } @@ -270,6 +378,7 @@ object SimilarityAnalysis extends Serializable { * https://github.com/tdunning/in-memory-cooccurrence/blob/master/src/main/java/com/tdunning/cooc/Analyze.java * * additionally binarizes input matrix, as we're only interesting in knowing whether interactions happened or not + * * @param drmM matrix to downsample * @param seed random number generator seed, keep to a constant if repeatability is neccessary * @param maxNumInteractions number of elements in a row of the returned matrix @@ -325,3 +434,18 @@ object SimilarityAnalysis extends Serializable { downSampledDrmI } } + +case class ParOpts( // this will contain the default `par` params except for auto = true + minPar: Int = -1, + exactPar: Int = -1, + autoPar: Boolean = true) + +/* Used to pass in data and params for downsampling the input data as well as output A'A, A'B, etc. */ +case class DownsamplableCrossOccurrenceDataset( + iD: IndexedDataset, + maxElementsPerRow: Int = 500, // usually items per user in the input dataset, used to ramdomly downsample + maxInterestingElements: Int = 50, // number of items/columns to keep in the A'A, A'B etc. where iD == A, B, C ... + minLLROpt: Option[Double] = None, // absolute threshold, takes precedence over maxInterestingElements if present + parOpts: Option[ParOpts] = None) // these can be set per dataset and are applied to each of the drms + // in crossOccurrenceDownsampled + http://git-wip-us.apache.org/repos/asf/mahout/blob/b5fe4aab/spark/src/test/scala/org/apache/mahout/cf/SimilarityAnalysisSuite.scala ---------------------------------------------------------------------- diff --git a/spark/src/test/scala/org/apache/mahout/cf/SimilarityAnalysisSuite.scala b/spark/src/test/scala/org/apache/mahout/cf/SimilarityAnalysisSuite.scala index 0b3b3eb..63e0df7 100644 --- a/spark/src/test/scala/org/apache/mahout/cf/SimilarityAnalysisSuite.scala +++ b/spark/src/test/scala/org/apache/mahout/cf/SimilarityAnalysisSuite.scala @@ -17,9 +17,11 @@ package org.apache.mahout.cf -import org.apache.mahout.math.cf.SimilarityAnalysis +import org.apache.mahout.math.cf.{DownsamplableCrossOccurrenceDataset, SimilarityAnalysis} import org.apache.mahout.math.drm._ +import org.apache.mahout.math.indexeddataset.BiDictionary import org.apache.mahout.math.scalabindings.{MatrixOps, _} +import org.apache.mahout.sparkbindings.indexeddataset.IndexedDatasetSpark import org.apache.mahout.sparkbindings.test.DistributedSparkSuite import org.apache.mahout.test.MahoutSuite import org.scalatest.FunSuite @@ -58,7 +60,7 @@ class SimilarityAnalysisSuite extends FunSuite with MahoutSuite with Distributed (1.7260924347106847, 0.6795961471815897, 0.6795961471815897, 1.7260924347106847, 0.0), (0.0, 0.0, 0.0, 0.0, 4.498681156950466)) - final val matrixLLRCoocBtAControl = dense( + final val matrixLLRCoocAtBControl = dense( (1.7260924347106847, 1.7260924347106847, 1.7260924347106847, 1.7260924347106847, 0.0), (0.6795961471815897, 0.6795961471815897, 0.6795961471815897, 0.6795961471815897, 0.0), (0.6795961471815897, 0.6795961471815897, 0.6795961471815897, 0.6795961471815897, 0.0), @@ -66,7 +68,7 @@ class SimilarityAnalysisSuite extends FunSuite with MahoutSuite with Distributed (0.0, 0.0, 0.6795961471815897, 0.0, 4.498681156950466)) - test("cooccurrence [A'A], [B'A] boolbean data using LLR") { + test("Cross-occurrence [A'A], [B'A] boolbean data using LLR") { val a = dense( (1, 1, 0, 0, 0), (0, 0, 1, 1, 0), @@ -91,13 +93,13 @@ class SimilarityAnalysisSuite extends FunSuite with MahoutSuite with Distributed //cross similarity val matrixCrossCooc = drmCooc(1).checkpoint().collect - val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocBtAControl) + val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocAtBControl) n = (new MatrixOps(m = diff2Matrix)).norm n should be < 1E-10 } - test("cooccurrence [A'A], [B'A] double data using LLR") { + test("Cross-occurrence [A'A], [B'A] double data using LLR") { val a = dense( (100000.0D, 1.0D, 0.0D, 0.0D, 0.0D), ( 0.0D, 0.0D, 10.0D, 1.0D, 0.0D), @@ -122,12 +124,12 @@ class SimilarityAnalysisSuite extends FunSuite with MahoutSuite with Distributed //cross similarity val matrixCrossCooc = drmCooc(1).checkpoint().collect - val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocBtAControl) + val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocAtBControl) n = (new MatrixOps(m = diff2Matrix)).norm n should be < 1E-10 } - test("cooccurrence [A'A], [B'A] integer data using LLR") { + test("Cross-occurrence [A'A], [B'A] integer data using LLR") { val a = dense( ( 1000, 10, 0, 0, 0), ( 0, 0, -10000, 10, 0), @@ -154,12 +156,12 @@ class SimilarityAnalysisSuite extends FunSuite with MahoutSuite with Distributed //cross similarity val matrixCrossCooc = drmCooc(1).checkpoint().collect - val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocBtAControl) + val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocAtBControl) n = (new MatrixOps(m = diff2Matrix)).norm n should be < 1E-10 } - test("cooccurrence two matrices with different number of columns"){ + test("Cross-occurrence two matrices with different number of columns"){ val a = dense( (1, 1, 0, 0, 0), (0, 0, 1, 1, 0), @@ -172,7 +174,7 @@ class SimilarityAnalysisSuite extends FunSuite with MahoutSuite with Distributed (0, 0, 1, 0), (1, 1, 0, 1)) - val matrixLLRCoocBtANonSymmetric = dense( + val matrixLLRCoocAtBNonSymmetric = dense( (0.0, 1.7260924347106847, 1.7260924347106847, 1.7260924347106847), (0.0, 0.6795961471815897, 0.6795961471815897, 0.0), (1.7260924347106847, 0.6795961471815897, 0.6795961471815897, 0.0), @@ -191,7 +193,7 @@ class SimilarityAnalysisSuite extends FunSuite with MahoutSuite with Distributed //cross similarity val matrixCrossCooc = drmCooc(1).checkpoint().collect - val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocBtANonSymmetric) + val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocAtBNonSymmetric) n = (new MatrixOps(m = diff2Matrix)).norm //cooccurrence without LLR is just a A'B @@ -199,6 +201,107 @@ class SimilarityAnalysisSuite extends FunSuite with MahoutSuite with Distributed //val bp = 0 } + test("Cross-occurrence two IndexedDatasets"){ + val a = dense( + (1, 1, 0, 0, 0), + (0, 0, 1, 1, 0), + (0, 0, 0, 0, 1), + (1, 0, 0, 1, 0)) + + val b = dense( + (0, 1, 1, 0), + (1, 1, 1, 0), + (0, 0, 1, 0), + (1, 1, 0, 1)) + + val users = Seq("u1", "u2", "u3", "u4") + val itemsA = Seq("a1", "a2", "a3", "a4", "a5") + val itemsB = Seq("b1", "b2", "b3", "b4") + val userDict = new BiDictionary(users) + val itemsADict = new BiDictionary(itemsA) + val itemsBDict = new BiDictionary(itemsB) + + // this is downsampled to the top 2 values per row to match the calc + val matrixLLRCoocAtBNonSymmetric = dense( + (0.0, 1.7260924347106847, 1.7260924347106847, 0.0), + (0.0, 0.6795961471815897, 0.6795961471815897, 0.0), + (1.7260924347106847, 0.6795961471815897, 0.0, 0.0), + (5.545177444479561, 1.7260924347106847, 0.0, 0.0), + (0.0, 0.0, 0.6795961471815897, 0.0)) + + val drmA = drmParallelize(m = a, numPartitions = 2) + val drmB = drmParallelize(m = b, numPartitions = 2) + + val aID = new IndexedDatasetSpark(drmA, userDict, itemsADict) + val bID = new IndexedDatasetSpark(drmB, userDict, itemsBDict) + val aD = DownsamplableCrossOccurrenceDataset(aID) + val bD = DownsamplableCrossOccurrenceDataset(bID, maxInterestingElements = 2) + + //self similarity + val drmCooc = SimilarityAnalysis.crossOccurrenceDownsampled(List(aD, bD)) + val matrixSelfCooc = drmCooc(0).matrix.checkpoint().collect + val diffMatrix = matrixSelfCooc.minus(matrixLLRCoocAtAControl) + var n = (new MatrixOps(m = diffMatrix)).norm + n should be < 1E-10 + + //cross similarity + val matrixCrossCooc = drmCooc(1).matrix.checkpoint().collect + val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocAtBNonSymmetric) + n = (new MatrixOps(m = diff2Matrix)).norm + n should be < 1E-10 + } + + test("Cross-occurrence two IndexedDatasets LLR threshold"){ + val a = dense( + (1, 1, 0, 0, 0), + (0, 0, 1, 1, 0), + (0, 0, 0, 0, 1), + (1, 0, 0, 1, 0)) + + val b = dense( + (0, 1, 1, 0), + (1, 1, 1, 0), + (0, 0, 1, 0), + (1, 1, 0, 1)) + + val users = Seq("u1", "u2", "u3", "u4") + val itemsA = Seq("a1", "a2", "a3", "a4", "a5") + val itemsB = Seq("b1", "b2", "b3", "b4") + val userDict = new BiDictionary(users) + val itemsADict = new BiDictionary(itemsA) + val itemsBDict = new BiDictionary(itemsB) + + // this is downsampled to the top 2 values per row to match the calc but also uses a min llr threshold so + // the # per row is still applied but nothing gets past the min llr check + val matrixLLRCoocAtBNonSymmetric = dense( + (0.0, 1.7260924347106847, 1.7260924347106847, 0.0), + (0.0, 0.0, 0.0, 0.0), + (1.7260924347106847, 0.0, 0.0, 0.0), + (5.545177444479561, 1.7260924347106847, 0.0, 0.0), + (0.0, 0.0, 0.0, 0.0)) + + val drmA = drmParallelize(m = a, numPartitions = 2) + val drmB = drmParallelize(m = b, numPartitions = 2) + + val aID = new IndexedDatasetSpark(drmA, userDict, itemsADict) + val bID = new IndexedDatasetSpark(drmB, userDict, itemsBDict) + val aD = DownsamplableCrossOccurrenceDataset(aID) + val bD = DownsamplableCrossOccurrenceDataset(bID, minLLROpt = Some(1.7), maxInterestingElements = 2) + + //self similarity + val drmCooc = SimilarityAnalysis.crossOccurrenceDownsampled(List(aD, bD)) + val matrixSelfCooc = drmCooc(0).matrix.checkpoint().collect + val diffMatrix = matrixSelfCooc.minus(matrixLLRCoocAtAControl) + var n = (new MatrixOps(m = diffMatrix)).norm + n should be < 1E-10 + + //cross similarity + val matrixCrossCooc = drmCooc(1).matrix.checkpoint().collect + val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocAtBNonSymmetric) + n = (new MatrixOps(m = diff2Matrix)).norm + n should be < 1E-10 + } + test("LLR calc") { val A = dense( (1, 1, 0, 0, 0),
