GitHub user danielyli opened a pull request:
https://github.com/apache/spark/pull/17767
Als refactor
## What changes were proposed in this pull request?
This is a non-feature-changing refactoring of the ALS code (specifically,
the `org.apache.spark.ml.recommendation` package), done to improve code
maintainability and to add significant documentation to the existing code. My
motivation for this PR is that I've been working on an online streaming ALS
implementation [[SPARK-6407](https://issues.apache.org/jira/browse/SPARK-6407)]
(PR coming soon), and I've been refactoring the package to help me understand
the existing code before adding to it. I've also tried my best to include a
fair bit of Scaladocs and inline comments where I felt they would have helped
when I was reading the code.
I've done a fair bit of rebasing and sausage making to make the commits
easy to follow, since no one likes to stare at a 2,700-line PR. Please let me
know if I can make anything clearer. I'd be happy to answer any questions.
In a few places, you'll find a `PLEASE_ADVISE(danielyli):` tag in the code.
These are questions I had in the course of the refactoring. I'd appreciate if
the relevant folks could help me with these. Thanks.
## How was this patch tested?
As this is a non-feature-changing refactoring, existing tests were used.
All existing ALS tests pass.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/danielyli/spark als-refactor
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17767.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17767
----
commit deca4db3f234ea60c1494265d4f3ac9375869dd6
Author: Daniel Li <[email protected]>
Date: 2017-04-05T21:55:21Z
Split `ALS.scala` into multiple files
This commit moves the classes `ALS` and `ALSModel` and the traits
`ALSParams` and `ALSModelParams` into their own files.
commit 4086bc9d0c7689e0d2047ac17ada29fe236eb6e6
Author: Daniel Li <[email protected]>
Date: 2017-04-05T22:20:32Z
Move solver classes into their own file
This commit puts the classes `LeastSquaresNESolver`, `CholeskySolver`,
`NNLSSolver`, and `NormalEquation` into a mixin in a separate file in
order to reduce the size and improve the readability of `ALS.scala`.
commit 8aaa533df6f3c9a4b4e8c5d5023f831daf06fa9e
Author: Daniel Li <[email protected]>
Date: 2017-04-05T22:30:38Z
Minor cleanup of imports
* import java.util.Arrays
* import scala.collection.mutable.ArrayBuilder
commit b68680025e71ebd422087ef95d5ecb7af40fa26d
Author: Daniel Li <[email protected]>
Date: 2017-04-05T22:48:50Z
Create a package object to hold small type and class definitions
commit 83f849ee45fd7c80a1a50fcf12da1eb99d8b6346
Author: Daniel Li <[email protected]>
Date: 2017-04-06T02:17:14Z
Refactor `RatingBlock`-related code
This commit moves the following classes and methods into new files,
separating and encapsulating them as appropriate:
* RatingBlock
* RatingBlockBuilder
* UncompressedInBlock
* UncompressedInBlockBuilder
* KeyWrapper
* UncompressedInBlockSort
* LocalIndexEncoder
* partitionRatings
* makeBlocks
In the course of this refactoring we create a new class, `RatingBlocks`,
to hold the user/item in/out block data and associated logic.
commit 819a00f7fe7384e588ce78cb65e9413ac6588401
Author: Daniel Li <[email protected]>
Date: 2017-04-06T07:08:43Z
Pull out `RatingBlock` from `RatingBlocks` into its own file
This commit puts the `RatingBlock` class into a mixin for the
`RatingBlocks` companion object to extend. This is done purely to
increase readability by reducing the file size of `RatingBlocks.scala`.
commit 56d10ba1fa627f343e67525e2a3b08e7287bfe2f
Author: Daniel Li <[email protected]>
Date: 2017-04-06T08:50:06Z
Tighten access modifiers where appropriate and make case classes `final`
commit b861d18784ba4ce688d3eacaea10169c9ce2d091
Author: Daniel Li <[email protected]>
Date: 2017-04-06T09:38:54Z
Improve code hygiene of `RatingBlocks`
Among other things, `while` loops that used manually incremented
counters have been changed to `for` loops to increase readability.
Performance should be nominally affected.
commit 5dfee79a1280d0a72bbe7b8596cdf86654fa0fbc
Author: Daniel Li <[email protected]>
Date: 2017-04-06T09:57:11Z
Spruce up `ALS#fit`
This commit adds vertical whitespace to improve readability.
commit 056d6d0ecc962f94c83f43a6384607bf8833d083
Author: Daniel Li <[email protected]>
Date: 2017-04-25T23:31:54Z
Mark `RatingBlocks` constructor as `private`
commit 34df11247ec3fdcf29e220bedcdf28b58d1ac4ec
Author: Daniel Li <[email protected]>
Date: 2017-04-25T23:32:44Z
Add Scaladocs to `RatingBlocks.scala`
commit 31b0dcd843d8edc16c6d2bf982d3de753b6dc066
Author: Daniel Li <[email protected]>
Date: 2017-04-06T09:59:24Z
Spruce up `ALS.train` and `ALS.initFactors`
* Add descriptive Scaladoc giving high-level overview of `ALS.train`
algorithm
* Refactor out duplicated code into variables
* Add comments to clarify code where appropriate
* Change `while` loops that used manually incremented counters to
`for` loops to increase readability. Performance should be
nominally affected.
* Clarify names of variables where appropriate
* Add vertical whitespace to improve readability
commit 42243fef420de6d7db28194f0dc804284182dc64
Author: Daniel Li <[email protected]>
Date: 2017-04-06T10:11:46Z
Spruce up `ALS.computeFactors`
* Refactor out duplicated code into variables
* Change `while` loops that used manually incremented counters to
`for` loops to increase readability. Performance should be
nominally affected.
* Clarify names of variables where appropriate
* Add vertical whitespace to improve readability
commit ac01f7c19513a35d838df1db11c91a0db6abb00f
Author: Daniel Li <[email protected]>
Date: 2017-04-25T08:02:40Z
Update Scaladoc for `ALS` class
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]