[
https://issues.apache.org/jira/browse/MAHOUT-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565852#comment-15565852
]
Hudson commented on MAHOUT-1883:
--------------------------------
SUCCESS: Integrated in Jenkins build Mahout-Quality #3398 (See
[https://builds.apache.org/job/Mahout-Quality/3398/])
MAHOUT-1883 closes no PR, adds dataset filtering for minimal needed to (pat:
rev 1f5e36f249aabc68495ec15f64f5ed6754d9f1e2)
* (edit) mr/pom.xml
* (edit) distribution/pom.xml
* (edit) spark/src/test/scala/org/apache/mahout/cf/SimilarityAnalysisSuite.scala
* (edit) hdfs/pom.xml
* (edit) flink/pom.xml
* (edit) math/pom.xml
* (edit) examples/pom.xml
* (edit) h2o/pom.xml
* (edit) spark/pom.xml
* (edit) pom.xml
* (edit) spark-shell/pom.xml
* (edit)
spark/src/main/scala/org/apache/mahout/sparkbindings/indexeddataset/IndexedDatasetSpark.scala
* (edit) buildtools/pom.xml
* (edit) math-scala/pom.xml
* (edit) integration/pom.xml
> Create a type if IndexedDataset that filters unneeded data for CCO
> ------------------------------------------------------------------
>
> Key: MAHOUT-1883
> URL: https://issues.apache.org/jira/browse/MAHOUT-1883
> Project: Mahout
> Issue Type: New Feature
> Components: Collaborative Filtering
> Affects Versions: 0.13.0
> Reporter: Pat Ferrel
> Assignee: Pat Ferrel
> Fix For: 0.13.0
>
>
> The collaborative filtering CCO algo uses drms for each "indicator" type. The
> input must have the same set of user-id and so the row rank for all input
> matrices must be the same.
> In the past we have padded the row-id dictionary to include new rows only in
> secondary matrices. This can lead to very large amounts of data processed in
> the CCO pipeline that does not affect the results. Put another way if the row
> doesn't exist in the primary matrix, there will be no cross-occurrence in the
> other calculated cooccurrences matrix.
> if we are calculating P'P and P'S, S will not need rows that don't exist in P
> so this Jira is to create an IndexedDataset companion object that takes an
> RDD[(String, String)] of interactions but that uses the dictionary from P for
> row-ids and filters out all data that doesn't correspond to P. The companion
> object will create the row-ids dictionary if it is not passed in, and use it
> to filter if it is passed in.
> We have seen data that can be reduced by many orders of magnitude using this
> technique. This could be handled outside of Mahout but always produces better
> performance and so this version of data-prep seems worth including.
> It does not affect the CLI version yet but could be included there in a
> future Jira.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)