[jira] [Updated] (MAHOUT-1883) Create a type if IndexedDataset that filters unneeded data for CCO

Pat Ferrel (JIRA) Sat, 01 Oct 2016 14:24:56 -0700

     [ 
https://issues.apache.org/jira/browse/MAHOUT-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Pat Ferrel updated MAHOUT-1883:
-------------------------------
    Sprint: Jan/Feb-2016

> Create a type if IndexedDataset that filters unneeded data for CCO
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-1883
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1883
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.13.0
>            Reporter: Pat Ferrel
>            Assignee: Pat Ferrel
>             Fix For: 0.13.0
>
>
> The collaborative filtering CCO algo uses drms for each "indicator" type. The 
> input must have the same set of user-id and so the row rank for all input 
> matrices must be the same.
> In the past we have padded the row-id dictionary to include new rows only in 
> secondary matrices. This can lead to very large amounts of data processed in 
> the CCO pipeline that does not affect the results. Put another way if the row 
> doesn't exist in the primary matrix, there will be no cross-occurrence in the 
> other calculated cooccurrences matrix
> if we are calculating P'P and P'S, S will not need rows that don't exist in P 
> so this Jira is to create an IndexedDataset companion object that takes an 
> RDD[(String, String)] of interactions but that uses the dictionary from P for 
> row-ids and filters out all data that doesn't correspond to P. The companion 
> object will create the row-ids dictionary if it is not passed in, and use it 
> to filter if it is passed in.
> We have seen data that can be reduced by many orders of magnitude using this 
> technique. This could be handled outside of Mahout but always produces better 
> performance and so this version of data-prep seems worth including.
> It does not effect the CLI version yet but could be included there in a 
> future Jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1883) Create a type if IndexedDataset that filters unneeded data for CCO

Reply via email to