Joseph K. Bradley created SPARK-20214:
-----------------------------------------
Summary: pyspark.mllib SciPyTests test_serialize
Key: SPARK-20214
URL: https://issues.apache.org/jira/browse/SPARK-20214
Project: Spark
Issue Type: Bug
Components: ML, MLlib, PySpark, Tests
Affects Versions: 2.0.2, 2.1.1, 2.2.0
Reporter: Joseph K. Bradley
I've seen a few failures of this line:
https://github.com/apache/spark/blame/402bf2a50ddd4039ff9f376b641bd18fffa54171/python/pyspark/mllib/tests.py#L847
It converts a scipy.sparse.lil_matrix to a dok_matrix and then to a
pyspark.mllib.linalg.Vector. The failure happens in the conversion to a vector
and indicates that the dok_matrix is not returning its values in sorted order.
(Actually, the failure is in _convert_to_vector, which converts the dok_matrix
to a csc_matrix and then passes the CSC data to the MLlib Vector constructor.)
Here's the stack trace:
{code}
Traceback (most recent call last):
File "/home/jenkins/workspace/python/pyspark/mllib/tests.py", line 847, in
test_serialize
self.assertEqual(sv, _convert_to_vector(lil.todok()))
File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line
78, in _convert_to_vector
return SparseVector(l.shape[0], csc.indices, csc.data)
File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line
556, in __init__
% (self.indices[i], self.indices[i + 1]))
TypeError: Indices 3 and 1 are not strictly increasing
{code}
This seems like a bug in _convert_to_vector, where we really should check
{{csc_matrix.has_sorted_indices}} first.
I haven't seen this bug in pyspark.ml.linalg, but it probably exists there too.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]