[
https://issues.apache.org/jira/browse/MAHOUT-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pat Ferrel updated MAHOUT-1883:
-------------------------------
Sprint: Jan/Feb-2016
> Create a type if IndexedDataset that filters unneeded data for CCO
> ------------------------------------------------------------------
>
> Key: MAHOUT-1883
> URL: https://issues.apache.org/jira/browse/MAHOUT-1883
> Project: Mahout
> Issue Type: Bug
> Components: Collaborative Filtering
> Affects Versions: 0.13.0
> Reporter: Pat Ferrel
> Assignee: Pat Ferrel
> Fix For: 0.13.0
>
>
> The collaborative filtering CCO algo uses drms for each "indicator" type. The
> input must have the same set of user-id and so the row rank for all input
> matrices must be the same.
> In the past we have padded the row-id dictionary to include new rows only in
> secondary matrices. This can lead to very large amounts of data processed in
> the CCO pipeline that does not affect the results. Put another way if the row
> doesn't exist in the primary matrix, there will be no cross-occurrence in the
> other calculated cooccurrences matrix
> if we are calculating P'P and P'S, S will not need rows that don't exist in P
> so this Jira is to create an IndexedDataset companion object that takes an
> RDD[(String, String)] of interactions but that uses the dictionary from P for
> row-ids and filters out all data that doesn't correspond to P. The companion
> object will create the row-ids dictionary if it is not passed in, and use it
> to filter if it is passed in.
> We have seen data that can be reduced by many orders of magnitude using this
> technique. This could be handled outside of Mahout but always produces better
> performance and so this version of data-prep seems worth including.
> It does not effect the CLI version yet but could be included there in a
> future Jira.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)