Pat Ferrel created MAHOUT-1883:
----------------------------------
Summary: Create a type if IndexedDataset that filters unneeded
data for CCO
Key: MAHOUT-1883
URL: https://issues.apache.org/jira/browse/MAHOUT-1883
Project: Mahout
Issue Type: Bug
Components: Collaborative Filtering
Affects Versions: 0.13.0
Reporter: Pat Ferrel
Assignee: Pat Ferrel
Fix For: 0.13.0
The collaborative filtering CCO algo uses drms for each "indicator" type. The
input must have the same set of user-id and so the row rank for all input
matrices must be the same.
In the past we have padded the row-id dictionary to include new rows only in
secondary matrices. This can lead to very large amounts of data processed in
the CCO pipeline that does not affect the results. Put another way if the row
doesn't exist in the primary matrix, there will be no cross-occurrence in the
other calculated cooccurrences matrix
if we are calculating P'P and P'S, S will not need rows that don't exist in P
so this Jira is to create an IndexedDataset companion object that takes an
RDD[(String, String)] of interactions but that uses the dictionary from P for
row-ids and filters out all data that doesn't correspond to P. The companion
object will create the row-ids dictionary if it is not passed in, and use it to
filter if it is passed in.
We have seen data that can be reduced by many orders of magnitude using this
technique. This could be handled outside of Mahout but always produces better
performance and so this version of data-prep seems worth including.
It does not effect the CLI version yet but could be included there in a future
Jira.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)