Github user danielyli commented on a diff in the pull request:
https://github.com/apache/spark/pull/17793#discussion_r115114733
--- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
---
@@ -910,26 +944,143 @@ object ALS extends DefaultParamsReadable[ALS] with
Logging {
private type FactorBlock = Array[Array[Float]]
/**
- * Out-link block that stores, for each dst (item/user) block, which src
(user/item) factors to
- * send. For example, outLinkBlock(0) contains the local indices (not
the original src IDs) of the
- * src factors in this block to send to dst block 0.
+ * A mapping of the columns of the items factor matrix that are needed
when calculating each row
+ * of the users factor matrix, and vice versa.
+ *
+ * Specifically, when calculating a user factor vector, since only those
columns of the items
+ * factor matrix that correspond to the items that that user has rated
are needed, we can avoid
+ * having to repeatedly copy the entire items factor matrix to each
worker later in the algorithm
+ * by precomputing these dependencies for all users, storing them in an
RDD of `OutBlock`s. The
+ * items' dependencies on the columns of the users factor matrix is
computed similarly.
+ *
+ * =Example=
+ *
+ * Using the example provided in the `InBlock` Scaladoc, `userOutBlocks`
would look like the
+ * following:
+ *
+ * {{{ userOutBlocks.collect() == Seq(
+ * 0 -> Array(Array(0, 1), Array(0, 1)),
+ * 1 -> Array(Array(0), Array(0))) }}}
+ *
+ * The data structure encodes the following information:
--- End diff --
Updated, though I still don't like it very much. Honestly, reading either
of our versions would make my head spin if I weren't already acquainted with
the encoding; I'd still have to dive into the actual code and work out an
example for myself before I'd feel familiar with it. Should we just leave it
as-is?
Alternatively, if you feel you can write it clearer, please don't hesitate
to directly change the PR. (If you do update, note that the user IDs are not
random but are sorted ascendingly within each partition.)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]