Jan Luts created SPARK-8614:
-------------------------------
Summary: Row order preservation for operations on MLlib
IndexedRowMatrix
Key: SPARK-8614
URL: https://issues.apache.org/jira/browse/SPARK-8614
Project: Spark
Issue Type: Bug
Components: MLlib
Reporter: Jan Luts
In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are
dropped before calling the methods from RowMatrix. For example for
IndexedRowMatrix.computeSVD:
val svd = toRowMatrix().computeSVD(k, computeU, rCond)
and for IndexedRowMatrix.multiply:
val mat = toRowMatrix().multiply(B).
After computing these results, they are zipped with the original indices, e.g.
for IndexedRowMatrix.computeSVD
val indexedRows = indices.zip(svd.U.rows).map { case (i, v) =>
IndexedRow(i, v)
}
and for IndexedRowMatrix.multiply:
val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) =>
IndexedRow(i, v)
}
I have experienced that for IndexedRowMatrix.computeSVD().U and
IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row
indices can get mixed (when running Spark jobs with multiple
executors/machines): i.e. the vectors and indices of the result do not seem to
correspond anymore.
To me it looks like this is caused by zipping RDDs that have a different
ordering?
For the IndexedRowMatrix.multiply I have observed that ordering within
partitions is preserved, but that it seems to get mixed up between partitions.
For example, for:
part1Index1 part1Vector1
part1Index2 part1Vector2
part2Index1 part2Vector1
part2Index2 part2Vector2
I got:
part2Index1 part1Vector1
part2Index2 part1Vector2
part1Index1 part2Vector1
part1Index2 part2Vector2
Another observation is that the mapPartitions in RowMatrix.multiply :
val AB = rows.mapPartitions { iter =>
had an "preservesPartitioning = true" argument in version 1.0, but this is no
longer there.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]