[GitHub] spark pull request: [SPARK-9671] [MLLIB] re-org user guide and add...

feynmanliang Fri, 28 Aug 2015 12:07:38 -0700

Github user feynmanliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8498#discussion_r38233348
  
    --- Diff: docs/mllib-guide.md ---
    @@ -56,71 +63,63 @@ This lists functionality included in `spark.mllib`, the 
main MLlib API.
       * [limited-memory BFGS 
(L-BFGS)](mllib-optimization.html#limited-memory-bfgs-l-bfgs)
     * [PMML model export](mllib-pmml-model-export.html)
     
    -MLlib is under active development.
    -The APIs marked `Experimental`/`DeveloperApi` may change in future 
releases,
    -and the migration guide below will explain all changes between releases.
    -
     # spark.ml: high-level APIs for ML pipelines
     
    -Spark 1.2 introduced a new package called `spark.ml`, which aims to 
provide a uniform set of
    -high-level APIs that help users create and tune practical machine learning 
pipelines.
    -
    -*Graduated from Alpha!*  The Pipelines API is no longer an alpha 
component, although many elements of it are still `Experimental` or 
`DeveloperApi`.
    -
    -Note that we will keep supporting and adding features to `spark.mllib` 
along with the
    -development of `spark.ml`.
    -Users should be comfortable using `spark.mllib` features and expect more 
features coming.
    -Developers should contribute new algorithms to `spark.mllib` and can 
optionally contribute
    -to `spark.ml`.
    -
    -Guides for `spark.ml` include:
    +**[spark.ml programming guide](ml-guide.html)** provides an overview of 
the Pipelines API and major
    +concepts. It also contains sections on using algorithms within the 
Pipelines API, for example:
     
    -* **[spark.ml programming guide](ml-guide.html)**: overview of the 
Pipelines API and major concepts
    -* Guides on using algorithms within the Pipelines API:
    -  * [Feature transformers](ml-features.html), including a few not in the 
lower-level `spark.mllib` API
    -  * [Decision trees](ml-decision-tree.html)
    -  * [Ensembles](ml-ensembles.html)
    -  * [Linear methods](ml-linear-methods.html)
    +* [Feature extractors and transformers](ml-features.html)
    +* [Linear methods](ml-linear-methods.html)
    +* [Decision trees](ml-decision-tree.html)
    +* [Ensembles](ml-ensembles.html)
    +* [Artificial neural network](ml-ann.html)
     
     # Dependencies
     
    -MLlib uses the linear algebra package
    -[Breeze](http://www.scalanlp.org/), which depends on
    -[netlib-java](https://github.com/fommil/netlib-java) for optimised
    -numerical processing. If natives are not available at runtime, you
    -will see a warning message and a pure JVM implementation will be used
    -instead.
    +MLlib uses the linear algebra package [Breeze](http://www.scalanlp.org/), 
which depends on
    +[netlib-java](https://github.com/fommil/netlib-java) for optimised 
numerical processing.
    +If natives libraries[^1] are not available at runtime, you will see a 
warning message and a pure JVM
    +implementation will be used instead.
     
    -To learn more about the benefits and background of system optimised
    -natives, you may wish to watch Sam Halliday's ScalaX talk on
    -[High Performance Linear Algebra in 
Scala](http://fommil.github.io/scalax14/#/)).
    +Due to licensing issues with runtime proprietary binaries, we do not 
include `netlib-java`'s native
    +proxies by default.
    +To configure `netlib-java` / Breeze to use system optimised binaries, 
include
    +`com.github.fommil.netlib:all:1.1.2` (or build Spark with `-Pnetlib-lgpl`) 
as a dependency of your
    +project and read the [netlib-java](https://github.com/fommil/netlib-java) 
documentation for your
    +platform's additional installation instructions.
     
    -Due to licensing issues with runtime proprietary binaries, we do not
    -include `netlib-java`'s native proxies by default. To configure
    -`netlib-java` / Breeze to use system optimised binaries, include
    -`com.github.fommil.netlib:all:1.1.2` (or build Spark with
    -`-Pnetlib-lgpl`) as a dependency of your project and read the
    -[netlib-java](https://github.com/fommil/netlib-java) documentation for
    -your platform's additional installation instructions.
    +To use MLlib in Python, you will need [NumPy](http://www.numpy.org) 
version 1.4 or newer.
     
    -To use MLlib in Python, you will need [NumPy](http://www.numpy.org)
    -version 1.4 or newer.
    +[^1]: To learn more about the benefits and background of system optimised 
natives, you may wish to
    +    watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra 
in Scala](http://fommil.github.io/scalax14/#/).
     
    ----
    +# Migration guide
     
    -# Migration Guide
    +MLlib is under active development.
    +The APIs marked `Experimental`/`DeveloperApi` may change in future 
releases,
    +and the migration guide below will explain all changes between releases.
    +
    +## From 1.4 to 1.5
     
    -For the `spark.ml` package, please see the [spark.ml Migration 
Guide](ml-guide.html#migration-guide).
    +In the `spark.mllib` package, there are no break API changes but several 
behavior changes:
     
    -## From 1.3 to 1.4
    +* [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005):
    +  `RegressionMetrics.explainedVariance` returns the average regression sum 
of squares.
    +* [SPARK-8600](https://issues.apache.org/jira/browse/SPARK-8600): 
`NaiveBayesModel.labels` become
    +  sorted.
    +* [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382): 
`GradientDescent` has a default
    +  convergence tolerance `1e-3`, and hence iterations might end earlier 
than 1.4.
     
    -In the `spark.mllib` package, there were several breaking changes, but all 
in `DeveloperApi` or `Experimental` APIs:
    +In the `spark.ml` package, there exists one break API change and one 
behavior change:
     
    -* Gradient-Boosted Trees
    -    * *(Breaking change)* The signature of the 
[`Loss.gradient`](api/scala/index.html#org.apache.spark.mllib.tree.loss.Loss) 
method was changed.  This is only an issues for users who wrote their own 
losses for GBTs.
    -    * *(Breaking change)* The `apply` and `copy` methods for the case 
class 
[`BoostingStrategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.BoostingStrategy)
 have been changed because of a modification to the case class fields.  This 
could be an issue for users who use `BoostingStrategy` to set GBT parameters.
    -* *(Breaking change)* The return value of 
[`LDA.run`](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) has 
changed.  It now returns an abstract class `LDAModel` instead of the concrete 
class `DistributedLDAModel`.  The object of type `LDAModel` can still be cast 
to the appropriate concrete type, which depends on the optimization algorithm.
    +* [SPARK-9268](https://issues.apache.org/jira/browse/SPARK-9268): Java's 
varargs support is removed
    +  from `Params.setDefault` due to a
    +  [Scala compiler bug](https://issues.scala-lang.org/browse/SI-9013).
    +* [SPARK-10097](https://issues.apache.org/jira/browse/SPARK-10097): 
`Evaluator.isLargerBetter` is
    +  added to indicate metric ordering. Metrics like RMSE no longer flip 
signs as in 1.4.
     
    -## Previous Spark Versions
    +## Previous Spark versions
     
     Earlier migration guides are archived [on this 
page](mllib-migration-guides.html).
    +
    +---
    --- End diff --
    
    Ditto on divider



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-9671] [MLLIB] re-org user guide and add...

Reply via email to