GitHub user staple opened a pull request:

    https://github.com/apache/spark/pull/2347

    [SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached 
data.

    Add warnings to KMeans, GeneralizedLinearAlgorithm, and computeSVD when 
called with input data that is not cached. KMeans is implemented iteratively, 
and I believe that GeneralizedLinearAlgorithm’s current optimizers are 
iterative and its future optimizers are also likely to be iterative. 
RowMatrix’s computeSVD is iterative against an RDD when run in DistARPACK 
mode. ALS and DecisionTree are iterative as well, but they implement RDD 
caching internally so do not require a warning.
    
    I added a warning to GeneralizedLinearAlgorithm rather than inside its 
optimizers themselves, where the iteration actually occurs, because internally 
GeneralizedLinearAlgorithm maps its input data to an uncached RDD before 
passing it to an optimizer. (In other words, the warning would be printed for 
every GeneralizedLinearAlgorithm run, regardless of whether its input is 
cached, if the warning were in GradientDescent or other optimizer.) I assume 
that use of an uncached RDD by GeneralizedLinearAlgorithm is intentional, and 
that the mapping there (adding label, intercepts and scaling) is a lightweight 
operation. Arguably a user calling an optimizer such as GradientDescent will be 
knowledgable enough to cache their data without needing a log warning, so lack 
of a warning in the optimizers may be ok.
    
    This patch causes all calls to GeneralizedLinearAlgorithm from Python to 
print a warning, because the implementation in 
PythonMLLibAPI.trainRegressionModel deserializes the data from python using 
map(SerDe.deserializeLabeledPoint) to create a deserialized RDD without caching 
this new RDD. This means that deserialization must occur on every training 
iteration for RDDs originating in Python. Perhaps the python cache() call from 
_regression_train_wrapper / _get_unmangled_labeled_point_rdd should be moved to 
be after deserialization instead of before serialization. There is a similar 
issue in KMeans.
    
    Some of the documentation examples making use of these iterative algorithms 
did not cache their training RDDs (while others did). I updated the examples to 
always cache. I also fixed some (unrelated) minor errors in the documentation 
examples.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/staple/spark SPARK-1484

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2347.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2347
    
----
commit 7b31102b3ad68e821a21a31ab3e49fe069c98e9e
Author: Aaron Staple <[email protected]>
Date:   2014-09-10T14:18:17Z

    Minor doc example fixes.

commit bc90b68094c32678aa41fd65756105f9d3dd414b
Author: Aaron Staple <[email protected]>
Date:   2014-09-10T14:19:58Z

    [SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached 
data.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to