Hi Pat,
We truncate the indicators to the top-k and you don't want the
self-comparison in there. So I don't see a reason to not exclude it as
early as possible.
--sebatian
On 06/10/2014 05:28 PM, Pat Ferrel wrote:
Still getting the wrong values with non-boolean input so I’ll continue to look
at.
Another question is: computeIndicators seems to exclude self-comparison during
A’A and, of course, not for B’A. Since this returns the indicator matrix for
the general case shouldn’t it include those values? Seems like they should be
filtered out in the output phase if anywhere and that by option. If we were
actually returning a multiply we’d include those.
// exclude co-occurrences of the item with itself
if (crossCooccurrence || thingB != thingA) {
On Jun 10, 2014, at 1:49 AM, Sebastian Schelter <s...@apache.org> wrote:
Oh good catch! I had an extra binarize method before, so that the data was
already binary. I merged that into the downsample code and must have overlooked
that thing. You are right, numNonZeros is the way to go!
On 06/10/2014 01:11 AM, Ted Dunning wrote:
Sounds like a very plausible root cause.
On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA) <j...@apache.org> wrote:
[
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
]
Pat Ferrel commented on MAHOUT-1464:
------------------------------------
seems like the downsampleAndBinarize method is returning the wrong values.
It is actually summing the values where it should be counting the non-zero
elements?????
// Downsample the interaction vector of each user
for (userIndex <- 0 until keys.size) {
val interactionsOfUser = block(userIndex, ::) // this is a Vector
// if the values are non-boolean the sum will not be the number
of interactions it will be a sum of strength-of-interaction, right?
// val numInteractionsOfUser = interactionsOfUser.sum // doesn't
this sum strength of interactions?
val numInteractionsOfUser =
interactionsOfUser.getNumNonZeroElements() // should do this I think
val perUserSampleRate = math.min(maxNumInteractions,
numInteractionsOfUser) / numInteractionsOfUser
interactionsOfUser.nonZeroes().foreach { elem =>
val numInteractionsWithThing = numInteractions(elem.index)
val perThingSampleRate = math.min(maxNumInteractions,
numInteractionsWithThing) / numInteractionsWithThing
if (random.nextDouble() <= math.min(perUserSampleRate,
perThingSampleRate)) {
// We ignore the original interaction value and create a
binary 0-1 matrix
// as we only consider whether interactions happened or did
not happen
downsampledBlock(userIndex, elem.index) = 1
}
}
Cooccurrence Analysis on Spark
------------------------------
Key: MAHOUT-1464
URL: https://issues.apache.org/jira/browse/MAHOUT-1464
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
Fix For: 1.0
Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
run-spark-xrsj.sh
Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
a DRM can be used as input.
Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
has several applications including cross-action recommendations.
--
This message was sent by Atlassian JIRA
(v6.2#6252)