[ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15027491#comment-15027491
 ] 

Timothy Hunter commented on SPARK-8517:
---------------------------------------

- We need to make a whole page about how best practices with dataframes 
containing numerical data (vector UDTs). That was a big pain point for me. We 
have a whole page on spark.mllib and we should have something similar for 
dataframes.
- in `ml-guide`, I would split the high-level concepts (`fit`, `transform`, 
etc.) from chaining them together with a pipeline. From reading the current 
document, sparkML seems harder to use than spark.mllib because it introduces 
complicated examples right at the start (model selection with 
cross-validation). 
- small nit: the links under each example should link to the github file, right 
now they are not super useful. Do we have a ticket for that?


Building examples:
The current way to build a dead-simple dataframe is as follows. It is rather 
noisy when you compare it to python. I would recommend we move all the example 
code generation to a library, and thoroughly explain there what the dataframe 
contain (or make it part of the graph). For example:
{code}
val data = Array(-0.5, -0.3, 0.0, 0.2)
val dataFrame = 
sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("features")
{code}
This requires some understanding about tuple packing, the synthetic apply 
method, etc. Definitely more complicated than the python or RDD equivalent. I 
do not have a good solution right now, but I find this a bit unsettling when 
this is the first line I read in an example.

Other examples are easier to read, I find:
{code}
val training = sqlContext.createDataFrame(Seq((1.0, Vectors.dense(0.0, 1.2, 
-0.5)))).toDF("label", "features")
{code}

> Improve the organization and style of MLlib's user guide
> --------------------------------------------------------
>
>                 Key: SPARK-8517
>                 URL: https://issues.apache.org/jira/browse/SPARK-8517
>             Project: Spark
>          Issue Type: Improvement
>          Components: Documentation, ML, MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to