[
https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249549#comment-15249549
]
ASF GitHub Bot commented on MAHOUT-1833:
----------------------------------------
GitHub user resec opened a pull request:
https://github.com/apache/mahout/pull/224
MAHOUT-1833 - Enhance svec function to accept cardinality as parameter
### What is this PR for?
Enhance the existing svec function to accept cardinality as parameter(with
default value defined), so user can specify the created vector size they want.
### What type of PR is it?
[Improvement]
### Todos
* [x] - Add the cardinality parameter to svec with default value defined
* [x] - Add test case to MathSuite
* [ ] - Update any doc if needed(pending to check)
### What is the Jira issue?
* Open an issue on Jira https://issues.apache.org/jira/browse/ZEPPELIN/
* Put link here, and add [ZEPPELIN-*Jira number*] in PR title, eg.
[ZEPPELIN-533]
### How should this be tested?
1. Clone the code into local
2. Maven build and test, all tests should go to green
### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? Pending to check
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/resec/mahout new_svec
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/mahout/pull/224.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #224
----
commit c28ad3eed7329b0da7837f1102c8c1f8fba021f8
Author: yougoer <[email protected]>
Date: 2016-04-20T08:56:11Z
[MAHOUT-1833] add one more param cardinality with default value -1 and
corresponding test cases
----
> One more svec function accepting cardinality as parameter
> ----------------------------------------------------------
>
> Key: MAHOUT-1833
> URL: https://issues.apache.org/jira/browse/MAHOUT-1833
> Project: Mahout
> Issue Type: Improvement
> Affects Versions: 0.12.0
> Environment: Mahout Spark Shell 0.12.0,
> Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
> Centos7 64bit
> Reporter: Edmond Luo
>
> It will be nice to add one more wrapper function like below to
> org.apache.mahout.math.scalabindings
> {code}
> /**
> * create a sparse vector out of list of tuple2's with specific
> cardinality(size),
> * throws IllegalArgumentException if cardinality is not bigger than required
> cardinality of sdata
> * @param cardinality sdata
> * @return
> */
> def svec(cardinality: Int, sdata: TraversableOnce[(Int, AnyVal)]) = {
> val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
> if (cardinality < required) {
> throw new IllegalArgumentException(s"Cardinality[%cardinality] must be
> bigger than required[%required]!")
> }
> val initialCapacity = sdata.size
> val sv = new RandomAccessSparseVector(cardinality, initialCapacity)
> sdata.foreach(t ⇒ sv.setQuick(t._1,
> t._2.asInstanceOf[Number].doubleValue()))
> sv
> }
> {code}
> So user can specify the cardinality for the created sparse vector.
> This is very useful and convenient if user wants to create a DRM with many
> sparse vectors and the vectors are not with the same actual size(but with the
> same logical size, e.g. rows of a sparse matrix).
> Below code should demonstrate the case:
> {code}
> var cardinality = 20
> val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line =>
> (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++
> v2).map(row => (row._1, svec(cardinality, row._2)))
> val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
> // All element wise opperation will fail for those DRM with not
> cardinality-consistent SparseVector
> val drm2 = drm + drm
> val drm3 = drm - drm
> val drm4 = drm * drm
> val drm5 = drm / drm
> {code}
> Notice that in the last map, the svec acceptted one more cardinality
> parameter, so the cardinality of those created SparseVector can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)