Hi Edmond. Could you open a pull request so that we can review this? Thank you,
Andy -------- Original message -------- From: "Edmond Luo (JIRA)" <j...@apache.org> Date: 04/19/2016 4:29 AM (GMT-05:00) To: dev@mahout.apache.org Subject: [jira] [Updated] (MAHOUT-1833) One more svec function accepting cardinality as parameter [ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edmond Luo updated MAHOUT-1833: ------------------------------- Description: It will be nice to add one more wrapper function like below to org.apache.mahout.math.scalabindings {code} /** * create a sparse vector out of list of tuple2's with specific cardinality(size), * throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata * @param cardinality sdata * @return */ def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = { val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0 if (cardinality < required) { throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!") } val initialCapacity = sdata.size val sv = new RandomAccessSparseVector(cardinality, initialCapacity) sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue())) sv } {code} So user can specify the cardinality for the created sparse vector. This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix). Below code should demonstrate the case: {code} var cardinality = 20 val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(cardinality, row._2))) val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector]))) // All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector val drm2 = drm + drm val drm3 = drm - drm val drm4 = drm * drm val drm5 = drm / drm {code} Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created SparseVector can be consistent. was: It will be nice to add one more wrapper function like below to org.apache.mahout.math.scalabindings {code} /** * create a sparse vector out of list of tuple2's with specific cardinality(size), * throws IllegalArgumentException if cardinality is not bigger than required cardinality of sdata * @param cardinality sdata * @return */ def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = { val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0 if (cardinality < required) { throw new IllegalArgumentException(s"Cardinality[%cardinality] must be bigger than required[%required]!") } val initialCapacity = sdata.size val sv = new RandomAccessSparseVector(cardinality, initialCapacity) sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue())) sv } {code} So user can specify the cardinality for the created sparse vector. This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix). Below code should demonstrate the case: {code} var cardinality = 20 val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(cardinality, row._2))) val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector]))) // All element wise opperation will fail for those DRM with not cardinality-consistent SparseVector val drm2 = drm + drm val drm3 = drm - drm val drm4 = drm * drm val drm5 = drm / drm {code} Notice that in the last map, the svec in above accepts one more parameter, so the cardinality of those created SparseVector can be consistent. > One more svec function accepting cardinality as parameter > ---------------------------------------------------------- > > Key: MAHOUT-1833 > URL: https://issues.apache.org/jira/browse/MAHOUT-1833 > Project: Mahout > Issue Type: Improvement > Affects Versions: 0.12.0 > Environment: Mahout Spark Shell 0.12.0, > Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1, > Centos7 64bit > Reporter: Edmond Luo > > It will be nice to add one more wrapper function like below to > org.apache.mahout.math.scalabindings > {code} > /** > * create a sparse vector out of list of tuple2's with specific > cardinality(size), > * throws IllegalArgumentException if cardinality is not bigger than required > cardinality of sdata > * @param cardinality sdata > * @return > */ > def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = { > val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0 > if (cardinality < required) { > throw new IllegalArgumentException(s"Cardinality[%cardinality] must be > bigger than required[%required]!") > } > val initialCapacity = sdata.size > val sv = new RandomAccessSparseVector(cardinality, initialCapacity) > sdata.foreach(t ⇒ sv.setQuick(t._1, > t._2.asInstanceOf[Number].doubleValue())) > sv > } > {code} > So user can specify the cardinality for the created sparse vector. > This is very useful and convenient if user wants to create a DRM with many > sparse vectors and the vectors are not with the same actual size(but with the > same logical size, e.g. rows of a sparse matrix). > Below code should demonstrate the case: > {code} > var cardinality = 20 > val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => > (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ > v2).map(row => (row._1, svec(cardinality, row._2))) > val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector]))) > // All element wise opperation will fail for those DRM with not > cardinality-consistent SparseVector > val drm2 = drm + drm > val drm3 = drm - drm > val drm4 = drm * drm > val drm5 = drm / drm > {code} > Notice that in the last map, the svec acceptted one more cardinality > parameter, so the cardinality of those created SparseVector can be consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)