[ 
https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254341#comment-15254341
 ] 

ASF GitHub Bot commented on MAHOUT-1833:
----------------------------------------

Github user andrewpalumbo commented on the pull request:

    https://github.com/apache/mahout/pull/224#issuecomment-213532181
  
    @resec  as far as updating the  "In Core Reference Page" if its a short 
addition, It may be easiest to just add the text to your JIRA (or here on the 
PR) that way one of us can just make the addition.
    
    This looks good to me +1 to commit this.



> Enhance svec function to accept cardinality as parameter 
> ---------------------------------------------------------
>
>                 Key: MAHOUT-1833
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1833
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.12.0
>         Environment: Mahout Spark Shell 0.12.0,
> Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1, 
> Centos7 64bit
>            Reporter: Edmond Luo
>
> It will be nice to enhance the existing svec function in 
> org.apache.mahout.math.scalabindings
> {code}
>   /**
>    * create a sparse vector out of list of tuple2's
>    * @param sdata cardinality
>    * @return
>    */
>   def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = {
>     val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
>     var tmp = -1
>     if (cardinality < 0) {
>       tmp = required
>     } else if (cardinality < required) {
>       throw new IllegalArgumentException(s"Required cardinality %required but 
> got %cardinality")
>     } else {
>       tmp = cardinality
>     }
>     val initialCapacity = sdata.size
>     val sv = new RandomAccessSparseVector(tmp, initialCapacity)
>     sdata.foreach(t ⇒ sv.setQuick(t._1, 
> t._2.asInstanceOf[Number].doubleValue()))
>     sv
>   }
> {code}
> So user can specify the cardinality for the created sparse vector.
> This is very useful and convenient if user wants to create a DRM with many 
> sparse vectors and the vectors are not with the same actual size(but with the 
> same logical size, e.g. rows of a sparse matrix).
> Below code should demonstrate the case:
> {code}
> var cardinality = 20
> val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => 
> (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ 
> v2).map(row => (row._1, svec(row._2,cardinality)))
> val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
> // All below element wise opperations will fail for those DRM with not 
> cardinality-consistent SparseVector
> val drm2 = drm + drm.t
> val drm3 = drm - drm.t
> val drm4 = drm * drm.t
> val drm5 = drm / drm.t
> {code}
> Notice that in the last map, the svec acceptted one more cardinality 
> parameter, so the cardinality of those created sparse vectors can be 
> consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to