[
https://issues.apache.org/jira/browse/SPARK-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14616195#comment-14616195
]
Yanbo Liang commented on SPARK-8614:
------------------------------------
[~Jan Luts] Could you show some case which can help others to reproduce this
bug. I can not reproduce this bug in my environment.
> Row order preservation for operations on MLlib IndexedRowMatrix
> ---------------------------------------------------------------
>
> Key: SPARK-8614
> URL: https://issues.apache.org/jira/browse/SPARK-8614
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Reporter: Jan Luts
>
> In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are
> dropped before calling the methods from RowMatrix. For example for
> IndexedRowMatrix.computeSVD:
> val svd = toRowMatrix().computeSVD(k, computeU, rCond)
> and for IndexedRowMatrix.multiply:
> val mat = toRowMatrix().multiply(B).
> After computing these results, they are zipped with the original indices,
> e.g. for IndexedRowMatrix.computeSVD
> val indexedRows = indices.zip(svd.U.rows).map { case (i, v) =>
> IndexedRow(i, v)
> }
> and for IndexedRowMatrix.multiply:
>
> val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) =>
> IndexedRow(i, v)
> }
> I have experienced that for IndexedRowMatrix.computeSVD().U and
> IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row
> indices can get mixed (when running Spark jobs with multiple
> executors/machines): i.e. the vectors and indices of the result do not seem
> to correspond anymore.
> To me it looks like this is caused by zipping RDDs that have a different
> ordering?
> For the IndexedRowMatrix.multiply I have observed that ordering within
> partitions is preserved, but that it seems to get mixed up between
> partitions. For example, for:
> part1Index1 part1Vector1
> part1Index2 part1Vector2
> part2Index1 part2Vector1
> part2Index2 part2Vector2
> I got:
> part2Index1 part1Vector1
> part2Index2 part1Vector2
> part1Index1 part2Vector1
> part1Index2 part2Vector2
> Another observation is that the mapPartitions in RowMatrix.multiply :
> val AB = rows.mapPartitions { iter =>
> had an "preservesPartitioning = true" argument in version 1.0, but this is no
> longer there.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]