Timothy Hunter created SPARK-12212:
--------------------------------------

             Summary: Clarify the distinction between spark.mllib and spark.ml
                 Key: SPARK-12212
                 URL: https://issues.apache.org/jira/browse/SPARK-12212
             Project: Spark
          Issue Type: Sub-task
          Components: Documentation
    Affects Versions: 1.5.2
            Reporter: Timothy Hunter


There is a confusion in the documentation of MLLib as to what exactly MLlib: is 
it the package, or is it the whole effort of ML on spark, and how it differs 
from spark.ml? Is MLLib going to be deprecated?

We should do the following:
 - refer to the mllib the code package as spark.mllib across all the 
documentation. Alternative name is "RDD API of MLlib".
 - refer to MLlib the project that encompasses spark.ml + spark.mllib as MLlib 
(it should be the default)
 - replaces reference to "Pipeline API" by spark.ml or the "Dataframe API of 
MLlib". I would deemphasize that this API is for building pipelines. Some users 
are lead to believe from the documentation that spark.ml can only be used for 
building pipelines and that using a single algorithm can only be done with 
spark.mllib.

Most relevant places:
 - {{mllib-guide.md}}
 - {{mllib-linear-methods.md}}
 - {{mllib-dimensionality-reduction.md}}
 - {{mllib-pmml-model-export.md}}
 - {{mllib-statistics.md}}
In these files, most references to {{MLlib}} are meant to refer to 
{{spark.mllib}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to