Re: [jira] [Created] (MAHOUT-1833) One more svec function accepting cardinality as parameter

2016-04-19 Thread Suneel Marthi
Would you like to make a PR that can be reviewed?

Sent from my iPhone

> On Apr 19, 2016, at 4:20 AM, Edmond Luo (JIRA)  wrote:
> 
> Edmond Luo created MAHOUT-1833:
> --
> 
> Summary: One more svec function accepting cardinality as 
> parameter 
> Key: MAHOUT-1833
> URL: https://issues.apache.org/jira/browse/MAHOUT-1833
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.12.0
> Environment: Mahout Spark Shell 0.12.0,
> Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1, 
> Centos7 64bit
>Reporter: Edmond Luo
> 
> 
> It will be nice to add one more wrapper function like below to 
> org.apache.mahout.math.scalabindings
> 
> {code}
> /**
> * create a sparse vector out of list of tuple2's with specific 
> cardinality(size),
> * throws IllegalArgumentException if cardinality is not bigger than required 
> cardinality of sdata
> * @param cardinality sdata
> * @return
> */
> def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
>  val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
>  if (cardinality < required) {
>throw new IllegalArgumentException(s"Cardinality[%cardinality] must be 
> bigger than required[%required]!")
>  }
> 
>  val initialCapacity = sdata.size
>  val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
>  sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
>  sv
> }
> {code}
> 
> So user can specify the cardinality for the created sparse vector.
> 
> This is very useful and convenient if user wants to create a DRM with many 
> sparse vectors and the vectors are not with the same actual size(but with the 
> same logical size, e.g. rows of a sparse matrix).
> 
> Below code should demonstrate the case:
> {code}
> var cardinality = 20
> val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => 
> (line(0).toInt, Array((line(1).toInt,1.reduceByKey((v1, v2) => v1 ++ 
> v2).map(row => (row._1, svec(cardinality, row._2)))
> 
> val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
> 
> // All element wise opperation will fail for those DRM with not 
> cardinality-consistent SparseVector
> val drm2 = drm + drm
> val drm3 = drm - drm
> val drm4 = drm * drm
> val drm5 = drm / drm
> {code}
> 
> Notice that in the last map, the svec in above accepts one more parameter, so 
> the cardinality of those created SparseVector can be consistent.
> 
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)


[jira] [Created] (MAHOUT-1833) One more svec function accepting cardinality as parameter

2016-04-19 Thread Edmond Luo (JIRA)
Edmond Luo created MAHOUT-1833:
--

 Summary: One more svec function accepting cardinality as parameter 
 Key: MAHOUT-1833
 URL: https://issues.apache.org/jira/browse/MAHOUT-1833
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.12.0
 Environment: Mahout Spark Shell 0.12.0,
Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1, 
Centos7 64bit
Reporter: Edmond Luo


It will be nice to add one more wrapper function like below to 
org.apache.mahout.math.scalabindings

{code}
/**
 * create a sparse vector out of list of tuple2's with specific 
cardinality(size),
 * throws IllegalArgumentException if cardinality is not bigger than required 
cardinality of sdata
 * @param cardinality sdata
 * @return
 */
def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
  val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
  if (cardinality < required) {
throw new IllegalArgumentException(s"Cardinality[%cardinality] must be 
bigger than required[%required]!")
  }

  val initialCapacity = sdata.size
  val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
  sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
  sv
}
{code}

So user can specify the cardinality for the created sparse vector.

This is very useful and convenient if user wants to create a DRM with many 
sparse vectors and the vectors are not with the same actual size(but with the 
same logical size, e.g. rows of a sparse matrix).

Below code should demonstrate the case:
{code}
var cardinality = 20
val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => 
(line(0).toInt, Array((line(1).toInt,1.reduceByKey((v1, v2) => v1 ++ 
v2).map(row => (row._1, svec(cardinality, row._2)))

val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))

// All element wise opperation will fail for those DRM with not 
cardinality-consistent SparseVector
val drm2 = drm + drm
val drm3 = drm - drm
val drm4 = drm * drm
val drm5 = drm / drm
{code}

Notice that in the last map, the svec in above accepts one more parameter, so 
the cardinality of those created SparseVector can be consistent.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)