[
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15027491#comment-15027491
]
Timothy Hunter commented on SPARK-8517:
---------------------------------------
- We need to make a whole page about how best practices with dataframes
containing numerical data (vector UDTs). That was a big pain point for me. We
have a whole page on spark.mllib and we should have something similar for
dataframes.
- in `ml-guide`, I would split the high-level concepts (`fit`, `transform`,
etc.) from chaining them together with a pipeline. From reading the current
document, sparkML seems harder to use than spark.mllib because it introduces
complicated examples right at the start (model selection with
cross-validation).
- small nit: the links under each example should link to the github file, right
now they are not super useful. Do we have a ticket for that?
Building examples:
The current way to build a dead-simple dataframe is as follows. It is rather
noisy when you compare it to python. I would recommend we move all the example
code generation to a library, and thoroughly explain there what the dataframe
contain (or make it part of the graph). For example:
{code}
val data = Array(-0.5, -0.3, 0.0, 0.2)
val dataFrame =
sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("features")
{code}
This requires some understanding about tuple packing, the synthetic apply
method, etc. Definitely more complicated than the python or RDD equivalent. I
do not have a good solution right now, but I find this a bit unsettling when
this is the first line I read in an example.
Other examples are easier to read, I find:
{code}
val training = sqlContext.createDataFrame(Seq((1.0, Vectors.dense(0.0, 1.2,
-0.5)))).toDF("label", "features")
{code}
> Improve the organization and style of MLlib's user guide
> --------------------------------------------------------
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
> Issue Type: Improvement
> Components: Documentation, ML, MLlib
> Reporter: Xiangrui Meng
> Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page,
> doesn't have a nice style. We could update it and re-organize the content to
> make it easier to navigate.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]