[ 
https://issues.apache.org/jira/browse/MAHOUT-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565852#comment-15565852
 ] 

Hudson commented on MAHOUT-1883:
--------------------------------

SUCCESS: Integrated in Jenkins build Mahout-Quality #3398 (See 
[https://builds.apache.org/job/Mahout-Quality/3398/])
MAHOUT-1883 closes no PR, adds dataset filtering for minimal needed to (pat: 
rev 1f5e36f249aabc68495ec15f64f5ed6754d9f1e2)
* (edit) mr/pom.xml
* (edit) distribution/pom.xml
* (edit) spark/src/test/scala/org/apache/mahout/cf/SimilarityAnalysisSuite.scala
* (edit) hdfs/pom.xml
* (edit) flink/pom.xml
* (edit) math/pom.xml
* (edit) examples/pom.xml
* (edit) h2o/pom.xml
* (edit) spark/pom.xml
* (edit) pom.xml
* (edit) spark-shell/pom.xml
* (edit) 
spark/src/main/scala/org/apache/mahout/sparkbindings/indexeddataset/IndexedDatasetSpark.scala
* (edit) buildtools/pom.xml
* (edit) math-scala/pom.xml
* (edit) integration/pom.xml


> Create a type if IndexedDataset that filters unneeded data for CCO
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-1883
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1883
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.13.0
>            Reporter: Pat Ferrel
>            Assignee: Pat Ferrel
>             Fix For: 0.13.0
>
>
> The collaborative filtering CCO algo uses drms for each "indicator" type. The 
> input must have the same set of user-id and so the row rank for all input 
> matrices must be the same.
> In the past we have padded the row-id dictionary to include new rows only in 
> secondary matrices. This can lead to very large amounts of data processed in 
> the CCO pipeline that does not affect the results. Put another way if the row 
> doesn't exist in the primary matrix, there will be no cross-occurrence in the 
> other calculated cooccurrences matrix.
> if we are calculating P'P and P'S, S will not need rows that don't exist in P 
> so this Jira is to create an IndexedDataset companion object that takes an 
> RDD[(String, String)] of interactions but that uses the dictionary from P for 
> row-ids and filters out all data that doesn't correspond to P. The companion 
> object will create the row-ids dictionary if it is not passed in, and use it 
> to filter if it is passed in.
> We have seen data that can be reduced by many orders of magnitude using this 
> technique. This could be handled outside of Mahout but always produces better 
> performance and so this version of data-prep seems worth including.
> It does not affect the CLI version yet but could be included there in a 
> future Jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to