[ 
https://issues.apache.org/jira/browse/SPARK-10055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711298#comment-14711298
 ] 

Kai Sasaki commented on SPARK-10055:
------------------------------------

I submitted the initial version of this competition. Although the score is not 
good, there are several points I found in using Spark ML API. There might be 
something which is just caused by my lack of knowledge of Spark ML. So if we 
can already solve with existing code, please let me know.

* There does not seem to be {{Transformer}} which can cast type of columns. In 
this case, {{X}} and {{Y}} are String as default when read by 
[spark-csv|http://spark-packages.org/package/databricks/spark-csv].
  In order to use {{StandardScaler}} to {{X}} and {{Y}}, they must be numeric 
types. I cannot do that with Spark ML `Transformer`. Fortunately, {{spark-csv}}
  can infer types of schema to reading all data once. But in case of no such 
option in reading library, I think it is better to cast column types in Spark 
ML pipeline.
  
* {{StringIndexer}} exports its labels in order by frequencies. But in this 
competition, we have to write in alphabetical order. We have to write some 
extra code
  to convert frequency order labels to alphabetical order.
  
* {{StandardScaler}} can only receive vector data as its own input. In this 
case, I want to scale {{X}} and {{Y}} with {{StandardScaler}}. 
  But these are simple double data, it is necessary to assemble these values 
into feature vector. Is there some case to use `StandardScaler`
  to simple Int data or Double data? We have to assemble these data into a 
feature vector before scaling?
  
The code is 
[here|https://github.com/Lewuathe/kaggle-jobs/blob/master/src/main/scala/com/lewuathe/SfCrimeClassification.scala].
 Thank you.


> San Francisco Crime Classification
> ----------------------------------
>
>                 Key: SPARK-10055
>                 URL: https://issues.apache.org/jira/browse/SPARK-10055
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Xiangrui Meng
>            Assignee: Xusen Yin
>
> Apply ML pipeline API to San Francisco Crime Classification 
> (https://www.kaggle.com/c/sf-crime).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to