[GitHub] spark pull request #17793: [SPARK-20484][MLLIB] Add documentation to ALS cod...

danielyli Fri, 28 Apr 2017 17:00:07 -0700

Github user danielyli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17793#discussion_r114038705
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala 
---
    @@ -910,26 +944,127 @@ object ALS extends DefaultParamsReadable[ALS] with 
Logging {
       private type FactorBlock = Array[Array[Float]]
     
       /**
    -   * Out-link block that stores, for each dst (item/user) block, which src 
(user/item) factors to
    -   * send. For example, outLinkBlock(0) contains the local indices (not 
the original src IDs) of the
    -   * src factors in this block to send to dst block 0.
    +   * Out-link blocks that store information about which columns of the 
items factor matrix are
    +   * required to calculate which rows of the users factor matrix, and vice 
versa.
    +   *
    +   * Specifically, when calculating a user factor vector, since only those 
columns of the items
    +   * factor matrix that correspond to the items that that user has rated 
are needed, we can avoid
    +   * having to repeatedly copy the entire items factor matrix to each 
worker later in the algorithm
    +   * by precomputing these dependencies for all users, storing them in an 
RDD of `OutBlock`s.  The
    +   * items' dependencies on the columns of the users factor matrix is 
computed similarly.
        */
       private type OutBlock = Array[Array[Int]]
     
       /**
    -   * In-link block for computing src (user/item) factors. This includes 
the original src IDs
    -   * of the elements within this block as well as encoded dst (item/user) 
indices and corresponding
    -   * ratings. The dst indices are in the form of (blockId, localIndex), 
which are not the original
    -   * dst IDs. To compute src factors, we expect receiving dst factors that 
match the dst indices.
    -   * For example, if we have an in-link record
    +   * In-link block for computing user and item factor matrices.
        *
    -   * {srcId: 0, dstBlockId: 2, dstLocalIndex: 3, rating: 5.0},
    +   * The ALS algorithm partitions the columns of the users factor matrix 
evenly among Spark workers.
    +   * Since each column of the factor matrix is calculated using the known 
ratings of the correspond-
    +   * ing user, and since the ratings don't change across iterations, the 
ALS algorithm preshuffles
    +   * the ratings to the appropriate partitions, storing them in `InBlock` 
objects.
        *
    -   * and assume that the dst factors are stored as dstFactors: Map[Int, 
Array[Array[Float]]], which
    -   * is a blockId to dst factors map, the corresponding dst factor of the 
record is dstFactor(2)(3).
    +   * The ratings shuffled by item ID are computed similarly and also 
stored in `InBlock` objects.
    +   * Note that this means every rating is stored twice, once as shuffled 
by user ID and once by item
    +   * ID.  This is a necessary tradeoff, since in general a rating will not 
be on the same worker
    +   * when partitioned by user as by item.
        *
    -   * We use a CSC-like (compressed sparse column) format to store the 
in-link information. So we can
    -   * compute src factors one after another using only one normal equation 
instance.
    +   * =Example=
    +   *
    +   * Say we have a small collection of eight items to offer the seven 
users in our application.  We
    +   * have some known ratings given by the users, as seen in the matrix 
below:
    +   *
    +   * {{{
    +   *                       Items
    +   *            0   1   2   3   4   5   6   7
    +   *          +---+---+---+---+---+---+---+---+
    +   *        0 |   |0.1|   |   |0.4|   |   |0.7|
    +   *          +---+---+---+---+---+---+---+---+
    +   *        1 |   |   |   |   |   |   |   |   |
    +   *          +---+---+---+---+---+---+---+---+
    +   *     U  2 |   |   |   |   |   |   |   |   |
    +   *     s    +---+---+---+---+---+---+---+---+
    +   *     e  3 |   |3.1|   |   |3.4|   |   |3.7|
    +   *     r    +---+---+---+---+---+---+---+---+
    +   *     s  4 |   |   |   |   |   |   |   |   |
    +   *          +---+---+---+---+---+---+---+---+
    +   *        5 |   |   |   |   |   |   |   |   |
    +   *          +---+---+---+---+---+---+---+---+
    +   *        6 |   |6.1|   |   |6.4|   |   |6.7|
    +   *          +---+---+---+---+---+---+---+---+
    +   * }}}
    +   *
    +   * The ratings are represented as an RDD, passed to the 
`partitionRatings` method as the `ratings`
    +   * parameter:
    +   *
    +   * {{{
    +   *     ratings.collect() == Seq(
    +   *       Rating(0, 1, 0.1f),
    +   *       Rating(0, 4, 0.4f),
    +   *       Rating(0, 7, 0.7f),
    +   *       Rating(3, 1, 3.1f),
    +   *       Rating(3, 4, 3.4f),
    +   *       Rating(3, 7, 3.7f),
    +   *       Rating(6, 1, 6.1f),
    +   *       Rating(6, 4, 6.4f),
    +   *       Rating(6, 7, 6.7f)
    +   *     )
    +   * }}}
    +   *
    +   * (In this contrived example, the rating values are chosen specifically 
for clarity and are in
    --- End diff --
    
    Actually, on second thought, the first clause of the first sentence 
clarifies why, if ratings are usually whole numbers, we're using floats; the 
first sentence justifies the second sentence.  I would err on keeping the whole 
thing in as is.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #17793: [SPARK-20484][MLLIB] Add documentation to ALS cod...

Reply via email to