zhang jun wei created SPARK-4872:
------------------------------------
Summary: Provide sample format of training/test data in MLlib
programming guide
Key: SPARK-4872
URL: https://issues.apache.org/jira/browse/SPARK-4872
Project: Spark
Issue Type: Improvement
Components: Documentation
Affects Versions: 1.1.1
Reporter: zhang jun wei
I suggest: in samples of the online programming guide of MLlib, it's better to
give examples in the real life data, and list the translated data format for
the model to consume.
The problem blocking me is how to translate the real life data into the format
which MLLib can understand correctly.
Here is one sample, I want to use NaiveBayes to train and predict tennis-play
decision, the original data is:
Weather | Temperature | Humidity | Wind => Decision to play tennis
Sunny | Hot | High | No => No
Sunny | Hot | High | Yes => No
Cloudy | Normal | Normal | No => Yes
Rainy | Cold | Normal | Yes => No
Now, from my understanding, one potential translation is:
1) put every feature value word into a line:
Sunny Cloudy Rainy Hot Normal Cold High Normal Yes No
2) map them to numbers:
1 2 3 4 5 6 7 8 9 10
3) map decision labels to numbers:
0 - No
1 - Yes
4) set the value to 1 if it appears, or 0 if not, for the above example, here
is the data format for MLUtils.loadLibSVMFile to use:
0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:0 10:1
0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:1 10:0
1 1:0 2:1 3:0 4:0 5:1 6:0 7:0 8:1 9:0 10:1
0 1:0 2:0 3:1 4:0 5:0 6:1 7:0 8:1 9:1 10:0
==> Is this a correct understanding?
And another way I can image is:
1) put every feature name into a line:
Weather Temperature Humidity Wind
2) map them to numbers:
1 2 3 4
3) map decision labels to numbers:
0 - No
1 - Yes
4) map each value of each feature to a number (e.g. Sunny to 1, Cloudy to 2,
Rainy to 3; Hot to 1, Normal to 2, Cold to 3; High to 1, Normal to 2; Yes to 1,
No to 2) for the above example, here is the data format for
MLUtils.loadLibSVMFile to use:
0 1:1 2:1 3:1 4:2
0 1:1 2:1 3:1 4:1
1 1:2 2:2 3:2 4:2
0 1:3 2:3 3:2 4:1
==> but when I read the source code in NaiveBayes.scala, seems this is not
correct, I am not sure though...
So which data format translation way is correct?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]