[jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

ASF GitHub Bot (JIRA) Thu, 19 Jun 2014 12:15:06 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037704#comment-14037704
 ]


ASF GitHub Bot commented on MAHOUT-1541:
----------------------------------------

Github user dlyubimov commented on a diff in the pull request:

    https://github.com/apache/mahout/pull/22#discussion_r13987451
  
    --- Diff: 
spark/src/main/scala/org/apache/mahout/cf/examples/Recommendations.scala ---
    @@ -0,0 +1,172 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.mahout.cf.examples
    +
    +import scala.io.Source
    +import org.apache.mahout.math._
    +import scalabindings._
    +import RLikeOps._
    +import drm._
    +import RLikeDrmOps._
    +import org.apache.mahout.sparkbindings._
    +
    +import org.apache.mahout.cf.CooccurrenceAnalysis._
    +import scala.collection.JavaConversions._
    +
    +/**
    + * The Epinions dataset contains ratings from users to items and a 
trust-network between the users.
    + * We use co-occurrence analysis to compute "users who like these items, 
also like that items" and
    + * "users who trust these users, like that items"
    + *
    + * Download and unpack the dataset files from:
    + *
    + * 
http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2
    + * http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2
    + **/
    +object RunCrossCooccurrenceAnalysisOnEpinions {
    +
    +  def main(args: Array[String]): Unit = {
    +
    +    if (args.length == 0) {
    +      println("Usage: RunCooccurrenceAnalysisOnMovielens1M 
<path-to-dataset-folder>")
    +      println("Download the dataset from 
http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2 and")
    +      
println("http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2";)
    +      sys.exit(-1)
    +    }
    +
    +    val datasetDir = args(0)
    +
    +    val epinionsRatings = new SparseMatrix(49290, 139738)
    +
    +    var firstLineSkipped = false
    +    for (line <- Source.fromFile(datasetDir + 
"/ratings_data.txt").getLines()) {
    +      if (line.contains(' ') && firstLineSkipped) {
    +        val tokens = line.split(' ')
    +        val userID = tokens(0).toInt - 1
    +        val itemID = tokens(1).toInt - 1
    +        val rating = tokens(2).toDouble
    +        epinionsRatings(userID, itemID) = rating
    +      }
    +      firstLineSkipped = true
    +    }
    +
    +    val epinionsTrustNetwork = new SparseMatrix(49290, 49290)
    +    firstLineSkipped = false
    +    for (line <- Source.fromFile(datasetDir + 
"/trust_data.txt").getLines()) {
    +      if (line.contains(' ') && firstLineSkipped) {
    +        val tokens = line.trim.split(' ')
    +        val userID = tokens(0).toInt - 1
    +        val trustedUserId = tokens(1).toInt - 1
    +        epinionsTrustNetwork(userID, trustedUserId) = 1
    +      }
    +      firstLineSkipped = true
    +    }
    +
    +    System.setProperty("spark.kryo.referenceTracking", "false")
    +    System.setProperty("spark.kryoserializer.buffer.mb", "100")
    +/* to run on local, can provide number of core by changing to local[4] */
    +    implicit val distributedContext = mahoutSparkContext(masterUrl = 
"local", appName = "MahoutLocalContext",
    +      customJars = Traversable.empty[String])
    --- End diff --
    
    again, no need to specify customJars if no jars are added.


> Create CLI Driver for Spark Cooccurrence Analysis
> -------------------------------------------------
>
>                 Key: MAHOUT-1541
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1541
>             Project: Mahout
>          Issue Type: Bug
>          Components: CLI
>            Reporter: Pat Ferrel
>            Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an 
> IndexedDataset with BiMap ID translation dictionaries, call the Spark 
> CooccurrenceAnalysis with the appropriate params, then write output with 
> external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will 
> support reading externally defined IDs and flexible formats. Output will be 
> of the legacy format or text files of the user's specification with 
> reattached Item IDs. 
> Support for legacy formats is a question, users can always use the legacy 
> code if they want this. Internal to the IndexedDataset is a Spark DRM so 
> pipelining can be accomplished without any writing to an actual file so the 
> legacy sequence file output may not be needed.
> Opinions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

Reply via email to