GitHub user ghoto opened a pull request:
https://github.com/apache/spark/pull/17940
Bug fix/spark 20687
## What changes were proposed in this pull request?
Bugfix for https://issues.apache.org/jira/browse/SPARK-20687
Before converting a CSCMatrix to a Matrix, the trailing buffer 0s added by
Breeze to rowIndices and data, are removed to avoid inconsistencies with
colPtrs. Notice that this trailing buffers are often generated after operations
between matrices such summation or subtraction, and this code causes therefore
exceptions on valid BlockMatrix.add, and BlockMatrix.substract operations,
because blocks are stored as SparseMatrix, converted to breeze and back to
sparse.
http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add
## How was this patch tested?
Added a test to MatricesSuite that verifies that the conversions are valid
and that code doesn't crash. Originally the same code would crash on Spark.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ghoto/spark bug-fix/SPARK-20687
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17940.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17940
----
commit 62d78a241c95d09896b731776e29a8cb883dfc49
Author: Ignacio Bermudez <[email protected]>
Date: 2017-05-10T04:31:03Z
Reproducing SPARK-20687
commit dbbd39121f3210f6edd7a74bb21853fbda20c0cb
Author: Ignacio Bermudez <[email protected]>
Date: 2017-05-10T18:03:14Z
[SPARK-20687] mllib.Matrices.fromBreeze may cause crash when converting
breeze CSCMatrix
In an operation of two A, B CSCMatrices the resulting C matrix may have
some extra 0s
in rowIndices and data which are created for performance improvement by
Breeze.
This causes problems on converting back to mllib.Matrix because it relies on
rowIndices and data being coherent with colPtrs. Therefore it is necessary
to truncate
rowIndices and data to the active number of elements hold by the C matrix,
before
creating a Spark's SparseMatrix.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]