[ 
https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edmond Luo updated MAHOUT-1833:
-------------------------------
    Description: 
It will be nice to enhance the existing svec function in 
org.apache.mahout.math.scalabindings

{code}
  /**
   * create a sparse vector out of list of tuple2's
   * @param sdata cardinality
   * @return
   */
  def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = {
    val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
    var tmp = -1
    if (cardinality < 0) {
      tmp = required
    } else if (cardinality < required) {
      throw new IllegalArgumentException(s"Required cardinality %required but 
got %cardinality")
    } else {
      tmp = cardinality
    }
    val initialCapacity = sdata.size
    val sv = new RandomAccessSparseVector(tmp, initialCapacity)
    sdata.foreach(t ⇒ sv.setQuick(t._1, 
t._2.asInstanceOf[Number].doubleValue()))
    sv
  }
{code}

So user can specify the cardinality for the created sparse vector.

This is very useful and convenient if user wants to create a DRM with many 
sparse vectors and the vectors are not with the same actual size(but with the 
same logical size, e.g. rows of a sparse matrix).

Below code should demonstrate the case:
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => 
(line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ 
v2).map(row => (row._1, svec(row._2,cardinality)))

val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))

// All below element wise opperations will fail for those DRM with not 
cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}

Notice that in the last map, the svec acceptted one more cardinality parameter, 
so the cardinality of those created sparse vectors can be consistent.


  was:
It will be nice to enhance the existing svec function in 
org.apache.mahout.math.scalabindings

{code}
/**
 * create a sparse vector out of list of tuple2's with specific 
cardinality(size),
 * throws IllegalArgumentException if cardinality is not bigger than required 
cardinality of sdata
 * @param cardinality sdata
 * @return
 */
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
  val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
  if (cardinality < required) {
    throw new IllegalArgumentException(s"Cardinality[%cardinality] must be 
bigger than required[%required]!")
  }

  val initialCapacity = sdata.size
  val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
  sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
  sv
}
{code}

So user can specify the cardinality for the created sparse vector.

This is very useful and convenient if user wants to create a DRM with many 
sparse vectors and the vectors are not with the same actual size(but with the 
same logical size, e.g. rows of a sparse matrix).

Below code should demonstrate the case:
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => 
(line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ 
v2).map(row => (row._1, svec(row._2,cardinality)))

val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))

// All element wise opperation will fail for those DRM with not 
cardinality-consistent SparseVector
val drm2 = drm + drm.t
val drm3 = drm - drm.t
val drm4 = drm * drm.t
val drm5 = drm / drm.t
{code}

Notice that in the last map, the svec acceptted one more cardinality parameter, 
so the cardinality of those created SparseVector can be consistent.



> Enhance svec function to accept cardinality as parameter 
> ---------------------------------------------------------
>
>                 Key: MAHOUT-1833
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1833
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.12.0
>         Environment: Mahout Spark Shell 0.12.0,
> Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1, 
> Centos7 64bit
>            Reporter: Edmond Luo
>
> It will be nice to enhance the existing svec function in 
> org.apache.mahout.math.scalabindings
> {code}
>   /**
>    * create a sparse vector out of list of tuple2's
>    * @param sdata cardinality
>    * @return
>    */
>   def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = {
>     val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
>     var tmp = -1
>     if (cardinality < 0) {
>       tmp = required
>     } else if (cardinality < required) {
>       throw new IllegalArgumentException(s"Required cardinality %required but 
> got %cardinality")
>     } else {
>       tmp = cardinality
>     }
>     val initialCapacity = sdata.size
>     val sv = new RandomAccessSparseVector(tmp, initialCapacity)
>     sdata.foreach(t ⇒ sv.setQuick(t._1, 
> t._2.asInstanceOf[Number].doubleValue()))
>     sv
>   }
> {code}
> So user can specify the cardinality for the created sparse vector.
> This is very useful and convenient if user wants to create a DRM with many 
> sparse vectors and the vectors are not with the same actual size(but with the 
> same logical size, e.g. rows of a sparse matrix).
> Below code should demonstrate the case:
> {code}
> var cardinality = 20
> val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => 
> (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ 
> v2).map(row => (row._1, svec(row._2,cardinality)))
> val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
> // All below element wise opperations will fail for those DRM with not 
> cardinality-consistent SparseVector
> val drm2 = drm + drm.t
> val drm3 = drm - drm.t
> val drm4 = drm * drm.t
> val drm5 = drm / drm.t
> {code}
> Notice that in the last map, the svec acceptted one more cardinality 
> parameter, so the cardinality of those created sparse vectors can be 
> consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to