[jira] [Commented] (SPARK-4872) Provide sample format of training/test data in MLlib programming guide

zhang jun wei (JIRA) Fri, 19 Dec 2014 17:45:23 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254441#comment-14254441
 ]


zhang jun wei commented on SPARK-4872:
--------------------------------------

Yes, it does clarify the usage, thanks very much for your patience.

> Provide sample format of training/test data in MLlib programming guide
> ----------------------------------------------------------------------
>
>                 Key: SPARK-4872
>                 URL: https://issues.apache.org/jira/browse/SPARK-4872
>             Project: Spark
>          Issue Type: Improvement
>          Components: Documentation
>    Affects Versions: 1.1.1
>            Reporter: zhang jun wei
>              Labels: documentation
>
> I suggest: in samples of the online programming guide of MLlib, it's better 
> to give examples in the real life data, and list the translated data format 
> for the model to consume. 
> The problem blocking me is how to translate the real life data into the 
> format which MLLib  can understand correctly. 
> Here is one sample, I want to use NaiveBayes to train and predict tennis-play 
> decision, the original data is:
> Weather | Temperature | Humidity | Wind  => Decision to play tennis
> Sunny     | Hot               | High       | No     => No
> Sunny     | Hot               | High       | Yes    => No
> Cloudy    | Normal         | Normal   | No     => Yes
> Rainy      | Cold             | Normal   | Yes    => No
> Now, from my understanding, one potential translation is:
> 1) put every feature value word into a line:
> Sunny Cloudy Rainy Hot Normal Cold High Normal Yes No
> 2) map them to numbers:
> 1 2 3 4 5 6 7 8 9 10
> 3) map decision labels to numbers:
> 0 - No
> 1 - Yes
> 4) set the value to 1 if it appears, or 0 if not, for the above example, here 
> is the data format for MLUtils.loadLibSVMFile to use:
> 0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:0 10:1
> 0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:1 10:0
> 1 1:0 2:1 3:0 4:0 5:1 6:0 7:0 8:1 9:0 10:1
> 0 1:0 2:0 3:1 4:0 5:0 6:1 7:0 8:1 9:1 10:0
> ==> Is this a correct understanding?
> And another way I can image is:
> 1) put every feature name into a line:
> Weather  Temperature  Humidity  Wind
> 2) map them to numbers:
> 1 2 3 4 
> 3) map decision labels to numbers:
> 0 - No
> 1 - Yes
> 4) map each value of each feature to a number (e.g. Sunny to 1, Cloudy to 2, 
> Rainy to 3; Hot to 1, Normal to 2, Cold to 3; High to 1, Normal to 2; Yes to 
> 1, No to 2) for the above example, here is the data format for 
> MLUtils.loadLibSVMFile to use:
> 0 1:1 2:1 3:1 4:2
> 0 1:1 2:1 3:1 4:1
> 1 1:2 2:2 3:2 4:2
> 0 1:3 2:3 3:2 4:1
> ==> but when I read the source code in NaiveBayes.scala, seems this is not 
> correct, I am not sure though...
> So which data format translation way is correct?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-4872) Provide sample format of training/test data in MLlib programming guide

Reply via email to