Github user sethah commented on a diff in the pull request:
https://github.com/apache/spark/pull/17793#discussion_r114003560
--- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
---
@@ -910,26 +944,127 @@ object ALS extends DefaultParamsReadable[ALS] with
Logging {
private type FactorBlock = Array[Array[Float]]
/**
- * Out-link block that stores, for each dst (item/user) block, which src
(user/item) factors to
- * send. For example, outLinkBlock(0) contains the local indices (not
the original src IDs) of the
- * src factors in this block to send to dst block 0.
+ * Out-link blocks that store information about which columns of the
items factor matrix are
+ * required to calculate which rows of the users factor matrix, and vice
versa.
+ *
+ * Specifically, when calculating a user factor vector, since only those
columns of the items
+ * factor matrix that correspond to the items that that user has rated
are needed, we can avoid
+ * having to repeatedly copy the entire items factor matrix to each
worker later in the algorithm
+ * by precomputing these dependencies for all users, storing them in an
RDD of `OutBlock`s. The
+ * items' dependencies on the columns of the users factor matrix is
computed similarly.
*/
private type OutBlock = Array[Array[Int]]
/**
- * In-link block for computing src (user/item) factors. This includes
the original src IDs
- * of the elements within this block as well as encoded dst (item/user)
indices and corresponding
- * ratings. The dst indices are in the form of (blockId, localIndex),
which are not the original
- * dst IDs. To compute src factors, we expect receiving dst factors that
match the dst indices.
- * For example, if we have an in-link record
+ * In-link block for computing user and item factor matrices.
*
- * {srcId: 0, dstBlockId: 2, dstLocalIndex: 3, rating: 5.0},
+ * The ALS algorithm partitions the columns of the users factor matrix
evenly among Spark workers.
+ * Since each column of the factor matrix is calculated using the known
ratings of the correspond-
+ * ing user, and since the ratings don't change across iterations, the
ALS algorithm preshuffles
+ * the ratings to the appropriate partitions, storing them in `InBlock`
objects.
*
- * and assume that the dst factors are stored as dstFactors: Map[Int,
Array[Array[Float]]], which
- * is a blockId to dst factors map, the corresponding dst factor of the
record is dstFactor(2)(3).
+ * The ratings shuffled by item ID are computed similarly and also
stored in `InBlock` objects.
+ * Note that this means every rating is stored twice, once as shuffled
by user ID and once by item
+ * ID. This is a necessary tradeoff, since in general a rating will not
be on the same worker
+ * when partitioned by user as by item.
*
- * We use a CSC-like (compressed sparse column) format to store the
in-link information. So we can
- * compute src factors one after another using only one normal equation
instance.
+ * =Example=
+ *
+ * Say we have a small collection of eight items to offer the seven
users in our application. We
+ * have some known ratings given by the users, as seen in the matrix
below:
+ *
+ * {{{
+ * Items
+ * 0 1 2 3 4 5 6 7
+ * +---+---+---+---+---+---+---+---+
+ * 0 | |0.1| | |0.4| | |0.7|
+ * +---+---+---+---+---+---+---+---+
+ * 1 | | | | | | | | |
+ * +---+---+---+---+---+---+---+---+
+ * U 2 | | | | | | | | |
+ * s +---+---+---+---+---+---+---+---+
+ * e 3 | |3.1| | |3.4| | |3.7|
+ * r +---+---+---+---+---+---+---+---+
+ * s 4 | | | | | | | | |
+ * +---+---+---+---+---+---+---+---+
+ * 5 | | | | | | | | |
+ * +---+---+---+---+---+---+---+---+
+ * 6 | |6.1| | |6.4| | |6.7|
+ * +---+---+---+---+---+---+---+---+
+ * }}}
+ *
+ * The ratings are represented as an RDD, passed to the
`partitionRatings` method as the `ratings`
+ * parameter:
+ *
+ * {{{
+ * ratings.collect() == Seq(
+ * Rating(0, 1, 0.1f),
+ * Rating(0, 4, 0.4f),
+ * Rating(0, 7, 0.7f),
+ * Rating(3, 1, 3.1f),
+ * Rating(3, 4, 3.4f),
+ * Rating(3, 7, 3.7f),
+ * Rating(6, 1, 6.1f),
+ * Rating(6, 4, 6.4f),
+ * Rating(6, 7, 6.7f)
+ * )
+ * }}}
+ *
+ * (In this contrived example, the rating values are chosen specifically
for clarity and are in
--- End diff --
This part seems unnecessary. Definitely the last sentence.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]