zhang jun wei created SPARK-4872:
------------------------------------

             Summary: Provide sample format of training/test data in MLlib 
programming guide
                 Key: SPARK-4872
                 URL: https://issues.apache.org/jira/browse/SPARK-4872
             Project: Spark
          Issue Type: Improvement
          Components: Documentation
    Affects Versions: 1.1.1
            Reporter: zhang jun wei


I suggest: in samples of the online programming guide of MLlib, it's better to 
give examples in the real life data, and list the translated data format for 
the model to consume. 

The problem blocking me is how to translate the real life data into the format 
which MLLib  can understand correctly. 

Here is one sample, I want to use NaiveBayes to train and predict tennis-play 
decision, the original data is:
Weather | Temperature | Humidity | Wind  => Decision to play tennis
Sunny     | Hot               | High       | No     => No
Sunny     | Hot               | High       | Yes    => No
Cloudy    | Normal         | Normal   | No     => Yes
Rainy      | Cold             | Normal   | Yes    => No

Now, from my understanding, one potential translation is:
1) put every feature value word into a line:
Sunny Cloudy Rainy Hot Normal Cold High Normal Yes No
2) map them to numbers:
1 2 3 4 5 6 7 8 9 10
3) map decision labels to numbers:
0 - No
1 - Yes
4) set the value to 1 if it appears, or 0 if not, for the above example, here 
is the data format for MLUtils.loadLibSVMFile to use:
0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:0 10:1
0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:1 10:0
1 1:0 2:1 3:0 4:0 5:1 6:0 7:0 8:1 9:0 10:1
0 1:0 2:0 3:1 4:0 5:0 6:1 7:0 8:1 9:1 10:0
==> Is this a correct understanding?

And another way I can image is:
1) put every feature name into a line:
Weather  Temperature  Humidity  Wind
2) map them to numbers:
1 2 3 4 
3) map decision labels to numbers:
0 - No
1 - Yes
4) map each value of each feature to a number (e.g. Sunny to 1, Cloudy to 2, 
Rainy to 3; Hot to 1, Normal to 2, Cold to 3; High to 1, Normal to 2; Yes to 1, 
No to 2) for the above example, here is the data format for 
MLUtils.loadLibSVMFile to use:
0 1:1 2:1 3:1 4:2
0 1:1 2:1 3:1 4:1
1 1:2 2:2 3:2 4:2
0 1:3 2:3 3:2 4:1
==> but when I read the source code in NaiveBayes.scala, seems this is not 
correct, I am not sure though...

So which data format translation way is correct?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to