GitHub user sethah opened a pull request:
https://github.com/apache/spark/pull/15628
[SPARK-17471][ML] Add compressed method to ML matrices
## What changes were proposed in this pull request?
This patch adds a `compressed` method to ML `Matrix` class, which returns
the minimal storage representation of the matrix - either sparse or dense.
Because the space occupied by a sparse matrix is dependent upon its layout
(i.e. column major or row major), this method must consider both cases. It may
also be useful to force the layout to be column or row major beforehand, so an
overload is added which takes in a `columnMajor: Boolean` parameter.
The compressed implementation relies upon two new abstract methods
`toDense(columnMajor: Boolean)` and `toSparse(columnMajor: Boolean)`, similar
to the compressed method implemented in the `Vector` class. These methods also
allow the layout of the resulting matrix to be specified via the `columnMajor`
parameter. More detail on the new methods is given below.
## How was this patch tested?
Added many new unit tests
## New methods (summary, not exhaustive list)
**Matrix trait**
* `def toDense(columnMajor: Boolean): DenseMatrix` (abstract) - converts
the matrix (either sparse or dense) to dense format
* `def toSparse(columnMajor: Boolean): SparseMatrix` (abstract) - converts
the matrix (either sparse or dense) to sparse format
* `def compressed: Matrix` - finds the minimum space representation of this
matrix, considering both column and row major layouts, and converts it
* `def compressed(columnMajor: Boolean): Matrix` - finds the minimum space
representation of this matrix considering only column OR row major, and
converts it
**DenseMatrix class**
* `def toDense(columnMajor: Boolean): DenseMatrix` - converts the dense
matrix to a dense matrix, optionally changing the layout (data is NOT
duplicated if the layouts are the same)
* `def toSparse(columnMajors: Boolean): SparseMatrix` - converts the dense
matrix to sparse matrix, using the specified layout
**SparseMatrix class**
* `def toDense(columnMajor: Boolean): DenseMatrix` - converts the sparse
matrix to a dense matrix, using the specified layout
* `def toSparse(columnMajors: Boolean): SparseMatrix` - converts the sparse
matrix to sparse matrix. If the sparse matrix contains any explicit zeros, they
are removed. If the layout requested does not match the current layout, data is
copied to a new representation. If the layouts match and no explicit zeros
exist, the current matrix is returned.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sethah/spark matrix_compress
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15628.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15628
----
commit 5a29a4513b9a917c05c117cd03efe79a2dd2875a
Author: sethah <[email protected]>
Date: 2016-09-08T21:52:42Z
first commit
commit d2abb730f6a152f43d9afbee416e36fc2d4e16b2
Author: sethah <[email protected]>
Date: 2016-09-22T14:48:02Z
start to add tests
commit ee8ca60096f54beabf8cce9f348bda6f78fdfbd2
Author: sethah <[email protected]>
Date: 2016-09-23T22:55:20Z
sparse to sparse stuff
commit 68fc20e3cf9087e855edcbd12177183a77c3c36b
Author: sethah <[email protected]>
Date: 2016-10-25T17:25:47Z
improve test cases and cleanup
commit 011b6019d78eb73e39a0de51d6a4d905a43fb2ad
Author: sethah <[email protected]>
Date: 2016-10-25T18:22:23Z
adding some helper methods and shoring up test cases
commit d00926efbe637133b0f2d27dbfba14ddd97f9e57
Author: sethah <[email protected]>
Date: 2016-10-25T19:34:01Z
cleanup
commit a51e2173089cf79781b0d9a37492a4c4b4080881
Author: sethah <[email protected]>
Date: 2016-10-25T19:51:07Z
minor cleanup
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]