Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/8517#discussion_r38269010
--- Diff: docs/ml-guide.md ---
@@ -24,61 +24,74 @@ title: Spark ML Programming Guide
The `spark.ml` package aims to provide a uniform set of high-level APIs
built on top of
[DataFrames](sql-programming-guide.html#dataframes) that help users create
and tune practical
machine learning pipelines.
-See the [Algorithm Guides section](#algorithm-guides) below for guides on
sub-packages of
+See the [algorithm guides](#algorithm-guides) section below for guides on
sub-packages of
`spark.ml`, including feature transformers unique to the Pipelines API,
ensembles, and more.
-**Table of Contents**
+**Table of contents**
* This will become a table of contents (this text will be scraped).
{:toc}
-# Main Concepts
+# Main concepts
-Spark ML standardizes APIs for machine learning algorithms to make it
easier to combine multiple algorithms into a single pipeline, or workflow.
This section covers the key concepts introduced by the Spark ML API.
+Spark ML standardizes APIs for machine learning algorithms to make it
easier to combine multiple
+algorithms into a single pipeline, or workflow.
+This section covers the key concepts introduced by the Spark ML API, where
the pipeline concept is
+mostly inspired by the [scikit-learn](http://scikit-learn.org/) project.
-* **[ML Dataset](ml-guide.html#ml-dataset)**: Spark ML uses the
[`DataFrame`](api/scala/index.html#org.apache.spark.sql.DataFrame) from Spark
SQL as a dataset which can hold a variety of data types.
-E.g., a dataset could have different columns storing text, feature
vectors, true labels, and predictions.
+* **[`DataFrame`](ml-guide.html#dataframe)**: Spark ML uses `DataFrame`
from Spark SQL as an ML
+ dataset, which can hold a variety of data types.
+ E.g., a `DataFrame` could have different columns storing text, feature
vectors, true labels, and predictions.
* **[`Transformer`](ml-guide.html#transformers)**: A `Transformer` is an
algorithm which can transform one `DataFrame` into another `DataFrame`.
-E.g., an ML model is a `Transformer` which transforms an RDD with features
into an RDD with predictions.
+E.g., an ML model is a `Transformer` which transforms `DataFrame` with
features into a `DataFrame` with predictions.
* **[`Estimator`](ml-guide.html#estimators)**: An `Estimator` is an
algorithm which can be fit on a `DataFrame` to produce a `Transformer`.
-E.g., a learning algorithm is an `Estimator` which trains on a dataset and
produces a model.
+E.g., a learning algorithm is an `Estimator` which trains on a `DataFrame`
and produces a model.
* **[`Pipeline`](ml-guide.html#pipeline)**: A `Pipeline` chains multiple
`Transformer`s and `Estimator`s together to specify an ML workflow.
-* **[`Param`](ml-guide.html#parameters)**: All `Transformer`s and
`Estimator`s now share a common API for specifying parameters.
+* **[`Parameter`](ml-guide.html#parameters)**: All `Transformer`s and
`Estimator`s now share a common API for specifying parameters.
-## ML Dataset
+## DataFrame
Machine learning can be applied to a wide variety of data types, such as
vectors, text, images, and structured data.
-Spark ML adopts the
[`DataFrame`](api/scala/index.html#org.apache.spark.sql.DataFrame) from Spark
SQL in order to support a variety of data types under a unified Dataset concept.
+Spark ML adopts the `DataFrame` from Spark SQL in order to support a
variety of data types.
--- End diff --
I thought about this but didn't figure out a good solution. Using
`spark.ml` everywhere is accurate but it makes the guide a little bit strange
to read. Another solution is to define `Spark ML` precisely somewhere in the
doc. Let me think about this and make a new PR if necessary.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]