[
https://issues.apache.org/jira/browse/SPARK-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251033#comment-14251033
]
zhang jun wei commented on SPARK-4872:
--------------------------------------
Hi Sean, thanks for the response.
I changed the input format: use -1 and 1 for binary labels; and encode feature
value from 0 to n-1;
Now the training data is:
-1 1:0 2:0 3:0 4:1
-1 1:0 2:0 3:0 4:0
1 1:1 2:0 3:0 4:1
1 1:2 2:1 3:0 4:1
1 1:2 2:2 3:1 4:1
-1 1:2 2:2 3:1 4:0
1 1:1 2:2 3:1 4:0
-1 1:0 2:1 3:0 4:1
1 1:0 2:2 3:1 4:1
1 1:2 2:1 3:1 4:1
1 1:0 2:1 3:1 4:0
1 1:1 2:1 3:0 4:0
But the thing I feel confused is the members of NaiveBayesModel, here is what I
got:
Labels: -1.0, 1.0
Pi: -1.0296194171811581, -0.44183275227903884
theta matrix:
-1.3862943611198906 -1.0986122886681098 -1.791759469228055 -1.3862943611198906
-1.1939224684724343 -1.0986122886681096 -1.7047480922384253 -1.7047480922384253
My confusion is -> why theta matrix is a 2x4 matrix? By reading the codes, I
think:
1) Pi is the probability value of each label (also processed with math log),
right?
2) theta matrix should be the P(FeatureValuei | label)/P(FeatureValuei) value
matrix (like: P(Weather=Cloudy|Decision=Yes)/P(Weather=Cloudy), also process
with math log) for each feature value on each label.
2.1) so when I predict on a test data:
(Cloudy, Normal, Normal, No)
2.2) it will compare P(decision=yes|Cloudy, Normal, Normal, No) and
P(decision=no|Cloudy, Normal, Normal, No) to determine the decision, so the
theta matrix will provide the data for each factor, like: P(Weather=Cloudy |
decision=yes)/P(Weather=Cloudy).
2.3) in this sample, weather feature has 3 values, Temperature has 3,
Humidity has 2, Wind has 2, so theta matrix should have 10 columns, but now it
only has 4 columns mapping to 4 features, so I am confused how it works.
2.4) that's why I described my 1st data translation model above (translate to
10 features by myself), in that way, the theta matrix is 2x10;
Hopefully, I described my confusion clearly on the NaiveBayesModel.
Btw, I tested the new training data with DecisionTree.trainClassifier (still
use MLUtils.loadLibSVMFile to load data), seems it requires label to be 0 and
1, otherwise it will report error with the label -1.
> Provide sample format of training/test data in MLlib programming guide
> ----------------------------------------------------------------------
>
> Key: SPARK-4872
> URL: https://issues.apache.org/jira/browse/SPARK-4872
> Project: Spark
> Issue Type: Improvement
> Components: Documentation
> Affects Versions: 1.1.1
> Reporter: zhang jun wei
> Labels: documentation
>
> I suggest: in samples of the online programming guide of MLlib, it's better
> to give examples in the real life data, and list the translated data format
> for the model to consume.
> The problem blocking me is how to translate the real life data into the
> format which MLLib can understand correctly.
> Here is one sample, I want to use NaiveBayes to train and predict tennis-play
> decision, the original data is:
> Weather | Temperature | Humidity | Wind => Decision to play tennis
> Sunny | Hot | High | No => No
> Sunny | Hot | High | Yes => No
> Cloudy | Normal | Normal | No => Yes
> Rainy | Cold | Normal | Yes => No
> Now, from my understanding, one potential translation is:
> 1) put every feature value word into a line:
> Sunny Cloudy Rainy Hot Normal Cold High Normal Yes No
> 2) map them to numbers:
> 1 2 3 4 5 6 7 8 9 10
> 3) map decision labels to numbers:
> 0 - No
> 1 - Yes
> 4) set the value to 1 if it appears, or 0 if not, for the above example, here
> is the data format for MLUtils.loadLibSVMFile to use:
> 0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:0 10:1
> 0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:1 10:0
> 1 1:0 2:1 3:0 4:0 5:1 6:0 7:0 8:1 9:0 10:1
> 0 1:0 2:0 3:1 4:0 5:0 6:1 7:0 8:1 9:1 10:0
> ==> Is this a correct understanding?
> And another way I can image is:
> 1) put every feature name into a line:
> Weather Temperature Humidity Wind
> 2) map them to numbers:
> 1 2 3 4
> 3) map decision labels to numbers:
> 0 - No
> 1 - Yes
> 4) map each value of each feature to a number (e.g. Sunny to 1, Cloudy to 2,
> Rainy to 3; Hot to 1, Normal to 2, Cold to 3; High to 1, Normal to 2; Yes to
> 1, No to 2) for the above example, here is the data format for
> MLUtils.loadLibSVMFile to use:
> 0 1:1 2:1 3:1 4:2
> 0 1:1 2:1 3:1 4:1
> 1 1:2 2:2 3:2 4:2
> 0 1:3 2:3 3:2 4:1
> ==> but when I read the source code in NaiveBayes.scala, seems this is not
> correct, I am not sure though...
> So which data format translation way is correct?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]