GitHub user sethah opened a pull request:

    https://github.com/apache/spark/pull/15628

    [SPARK-17471][ML] Add compressed method to ML matrices

    ## What changes were proposed in this pull request?
    
    This patch adds a `compressed` method to ML `Matrix` class, which returns 
the minimal storage representation of the matrix - either sparse or dense. 
Because the space occupied by a sparse matrix is dependent upon its layout 
(i.e. column major or row major), this method must consider both cases. It may 
also be useful to force the layout to be column or row major beforehand, so an 
overload is added which takes in a `columnMajor: Boolean` parameter.
    
    The compressed implementation relies upon two new abstract methods 
`toDense(columnMajor: Boolean)` and `toSparse(columnMajor: Boolean)`, similar 
to the compressed method implemented in the `Vector` class. These methods also 
allow the layout of the resulting matrix to be specified via the `columnMajor` 
parameter. More detail on the new methods is given below.
    
    ## How was this patch tested?
    Added many new unit tests
    
    ## New methods (summary, not exhaustive list)
    
    **Matrix trait**
    
    * `def toDense(columnMajor: Boolean): DenseMatrix` (abstract) - converts 
the matrix (either sparse or dense) to dense format
    * `def toSparse(columnMajor: Boolean): SparseMatrix` (abstract) -  converts 
the matrix (either sparse or dense) to sparse format
    * `def compressed: Matrix` - finds the minimum space representation of this 
matrix, considering both column and row major layouts, and converts it
    * `def compressed(columnMajor: Boolean): Matrix` - finds the minimum space 
representation of this matrix considering only column OR row major, and 
converts it
    
    **DenseMatrix class**
    
    * `def toDense(columnMajor: Boolean): DenseMatrix` - converts the dense 
matrix to a dense matrix, optionally changing the layout (data is NOT 
duplicated if the layouts are the same)
    * `def toSparse(columnMajors: Boolean): SparseMatrix` - converts the dense 
matrix to sparse matrix, using the specified layout
    
    **SparseMatrix class**
    
    * `def toDense(columnMajor: Boolean): DenseMatrix` - converts the sparse 
matrix to a dense matrix, using the specified layout
    * `def toSparse(columnMajors: Boolean): SparseMatrix` - converts the sparse 
matrix to sparse matrix. If the sparse matrix contains any explicit zeros, they 
are removed. If the layout requested does not match the current layout, data is 
copied to a new representation. If the layouts match and no explicit zeros 
exist, the current matrix is returned.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sethah/spark matrix_compress

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15628.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15628
    
----
commit 5a29a4513b9a917c05c117cd03efe79a2dd2875a
Author: sethah <[email protected]>
Date:   2016-09-08T21:52:42Z

    first commit

commit d2abb730f6a152f43d9afbee416e36fc2d4e16b2
Author: sethah <[email protected]>
Date:   2016-09-22T14:48:02Z

    start to add tests

commit ee8ca60096f54beabf8cce9f348bda6f78fdfbd2
Author: sethah <[email protected]>
Date:   2016-09-23T22:55:20Z

    sparse to sparse stuff

commit 68fc20e3cf9087e855edcbd12177183a77c3c36b
Author: sethah <[email protected]>
Date:   2016-10-25T17:25:47Z

    improve test cases and cleanup

commit 011b6019d78eb73e39a0de51d6a4d905a43fb2ad
Author: sethah <[email protected]>
Date:   2016-10-25T18:22:23Z

    adding some helper methods and shoring up test cases

commit d00926efbe637133b0f2d27dbfba14ddd97f9e57
Author: sethah <[email protected]>
Date:   2016-10-25T19:34:01Z

    cleanup

commit a51e2173089cf79781b0d9a37492a4c4b4080881
Author: sethah <[email protected]>
Date:   2016-10-25T19:51:07Z

    minor cleanup

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to