GitHub user staple opened a pull request:
https://github.com/apache/spark/pull/2347
[SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached
data.
Add warnings to KMeans, GeneralizedLinearAlgorithm, and computeSVD when
called with input data that is not cached. KMeans is implemented iteratively,
and I believe that GeneralizedLinearAlgorithmâs current optimizers are
iterative and its future optimizers are also likely to be iterative.
RowMatrixâs computeSVD is iterative against an RDD when run in DistARPACK
mode. ALS and DecisionTree are iterative as well, but they implement RDD
caching internally so do not require a warning.
I added a warning to GeneralizedLinearAlgorithm rather than inside its
optimizers themselves, where the iteration actually occurs, because internally
GeneralizedLinearAlgorithm maps its input data to an uncached RDD before
passing it to an optimizer. (In other words, the warning would be printed for
every GeneralizedLinearAlgorithm run, regardless of whether its input is
cached, if the warning were in GradientDescent or other optimizer.) I assume
that use of an uncached RDD by GeneralizedLinearAlgorithm is intentional, and
that the mapping there (adding label, intercepts and scaling) is a lightweight
operation. Arguably a user calling an optimizer such as GradientDescent will be
knowledgable enough to cache their data without needing a log warning, so lack
of a warning in the optimizers may be ok.
This patch causes all calls to GeneralizedLinearAlgorithm from Python to
print a warning, because the implementation in
PythonMLLibAPI.trainRegressionModel deserializes the data from python using
map(SerDe.deserializeLabeledPoint) to create a deserialized RDD without caching
this new RDD. This means that deserialization must occur on every training
iteration for RDDs originating in Python. Perhaps the python cache() call from
_regression_train_wrapper / _get_unmangled_labeled_point_rdd should be moved to
be after deserialization instead of before serialization. There is a similar
issue in KMeans.
Some of the documentation examples making use of these iterative algorithms
did not cache their training RDDs (while others did). I updated the examples to
always cache. I also fixed some (unrelated) minor errors in the documentation
examples.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/staple/spark SPARK-1484
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2347.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2347
----
commit 7b31102b3ad68e821a21a31ab3e49fe069c98e9e
Author: Aaron Staple <[email protected]>
Date: 2014-09-10T14:18:17Z
Minor doc example fixes.
commit bc90b68094c32678aa41fd65756105f9d3dd414b
Author: Aaron Staple <[email protected]>
Date: 2014-09-10T14:19:58Z
[SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached
data.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]