[ 
https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391126#comment-15391126
 ] 

Pat Ferrel edited comment on MAHOUT-1853 at 8/4/16 4:15 PM:
------------------------------------------------------------

To reword this issue...

The CCO analysis code currently only employs a single # of values per row of 
the P’X matrices. This has proven an insufficient threshold for many of the 
possible cross-occurrence types. The problem is that for a user * item input 
matrix, which becomes an item * item output a fixed # per row is fine but the 
implementation is a bit meaningless when there are only 20 columns of the X 
matrix. For instance if X = C category preferences, there may be only 20 
possible categories and with a threshold of 100 and the fact that users often 
have enough usage to trigger preference events on all categories (though 
resulting in a small LLR value), the P’C matrix is almost completely full. This 
reduces any value in P’C.

There are several ways to address:
1) have a # of indicators per row threshold for every P'X matrix, not one for 
all (the current impl)
2) use a fixed LLR threshold value per matrix
3) use a confidence of correlation value (a % maybe) that is calculated from 
the data by looking at the distribution in P’C or other. This is potentially 
O(n^2) where n = number of items in the matrix. This may be practical to 
calculate for some types of data since n may be very small.

1 and 2 are easy in the extreme, #3 can actually be calculated after the fact 
and used in #2 even if it is not included in Mahout.

I've started work on #1 and #2

[~ssc][~tdunning] I'm especially looking for comments on #3 above, calculating 
a % confidence of correlation. The function we use for LLR scoring is 
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala#L210


was (Author: pferrel):
To reword this issue...

The CCO analysis code currently only employs a single # of values per row of 
the P’? matrices. This has proven an insufficient threshold for many of the 
possible cross-occurrence types. The problem is that for a user * item input 
matrix, which becomes an item * item output a fixed # per row is fine but the 
implementation is a bit meaningless when there are only 20 columns of the ? 
matrix. For instance if ? = C category preferences, there may be only 20 
possible categories and with a threshold of 100 and the fact that users often 
have enough usage to trigger preference events on all categories (though 
resulting in a small LLR value), the P’C matrix is almost completely full. This 
reduces any value in P’C.

There are several ways to address:
1) have a # of indicators per row threshold for every matrix, not one for all 
(the current impl)
2) use a fixed LLR threshold value per matrix
3) use a confidence of correlation value (a % maybe) that is calculated from 
the data by looking at the distribution in P’C or other. This is potentially 
O(n^2) where n = number of items in the matrix. This may be practical to 
calculate for some types of data since n may be very small.

1 and 2 are easy in the extreme, #3 can actually be calculated after the fact 
and used in #2 even if it is not included in Mahout.

starting work on #1 and #2

> Improvements to CCO (Correlated Cross-Occurrence)
> -------------------------------------------------
>
>                 Key: MAHOUT-1853
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1853
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.12.0
>            Reporter: Andrew Palumbo
>            Assignee: Pat Ferrel
>             Fix For: 0.13.0
>
>
> Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold 
> calculation for LLR downsampling, and possible multiple fixed thresholds for 
> A’A, A’B etc. This is to account for the vast difference in dimensionality 
> between indicator types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to