[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391126#comment-15391126 ]
Pat Ferrel edited comment on MAHOUT-1853 at 8/4/16 4:15 PM: ------------------------------------------------------------ To reword this issue... The CCO analysis code currently only employs a single # of values per row of the P’X matrices. This has proven an insufficient threshold for many of the possible cross-occurrence types. The problem is that for a user * item input matrix, which becomes an item * item output a fixed # per row is fine but the implementation is a bit meaningless when there are only 20 columns of the X matrix. For instance if X = C category preferences, there may be only 20 possible categories and with a threshold of 100 and the fact that users often have enough usage to trigger preference events on all categories (though resulting in a small LLR value), the P’C matrix is almost completely full. This reduces any value in P’C. There are several ways to address: 1) have a # of indicators per row threshold for every P'X matrix, not one for all (the current impl) 2) use a fixed LLR threshold value per matrix 3) use a confidence of correlation value (a % maybe) that is calculated from the data by looking at the distribution in P’C or other. This is potentially O(n^2) where n = number of items in the matrix. This may be practical to calculate for some types of data since n may be very small. 1 and 2 are easy in the extreme, #3 can actually be calculated after the fact and used in #2 even if it is not included in Mahout. I've started work on #1 and #2 [~ssc][~tdunning] I'm especially looking for comments on #3 above, calculating a % confidence of correlation. The function we use for LLR scoring is https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala#L210 was (Author: pferrel): To reword this issue... The CCO analysis code currently only employs a single # of values per row of the P’? matrices. This has proven an insufficient threshold for many of the possible cross-occurrence types. The problem is that for a user * item input matrix, which becomes an item * item output a fixed # per row is fine but the implementation is a bit meaningless when there are only 20 columns of the ? matrix. For instance if ? = C category preferences, there may be only 20 possible categories and with a threshold of 100 and the fact that users often have enough usage to trigger preference events on all categories (though resulting in a small LLR value), the P’C matrix is almost completely full. This reduces any value in P’C. There are several ways to address: 1) have a # of indicators per row threshold for every matrix, not one for all (the current impl) 2) use a fixed LLR threshold value per matrix 3) use a confidence of correlation value (a % maybe) that is calculated from the data by looking at the distribution in P’C or other. This is potentially O(n^2) where n = number of items in the matrix. This may be practical to calculate for some types of data since n may be very small. 1 and 2 are easy in the extreme, #3 can actually be calculated after the fact and used in #2 even if it is not included in Mahout. starting work on #1 and #2 > Improvements to CCO (Correlated Cross-Occurrence) > ------------------------------------------------- > > Key: MAHOUT-1853 > URL: https://issues.apache.org/jira/browse/MAHOUT-1853 > Project: Mahout > Issue Type: New Feature > Affects Versions: 0.12.0 > Reporter: Andrew Palumbo > Assignee: Pat Ferrel > Fix For: 0.13.0 > > > Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold > calculation for LLR downsampling, and possible multiple fixed thresholds for > A’A, A’B etc. This is to account for the vast difference in dimensionality > between indicator types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)