[jira] [Commented] (SPARK-4872) Provide sample format of training/test data in MLlib programming guide

zhang jun wei (JIRA) Wed, 17 Dec 2014 18:20:54 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251033#comment-14251033
 ]


zhang jun wei commented on SPARK-4872:
--------------------------------------

Hi Sean, thanks for the response.

I changed the input format: use -1 and 1 for binary labels; and encode feature 
value from 0 to n-1; 
Now the training data is:
-1 1:0 2:0 3:0 4:1
-1 1:0 2:0 3:0 4:0
1 1:1 2:0 3:0 4:1
1 1:2 2:1 3:0 4:1
1 1:2 2:2 3:1 4:1
-1 1:2 2:2 3:1 4:0
1 1:1 2:2 3:1 4:0
-1 1:0 2:1 3:0 4:1
1 1:0 2:2 3:1 4:1
1 1:2 2:1 3:1 4:1
1 1:0 2:1 3:1 4:0
1 1:1 2:1 3:0 4:0

But the thing I feel confused is the members of NaiveBayesModel, here is what I 
got:
Labels: -1.0, 1.0
Pi: -1.0296194171811581, -0.44183275227903884
theta matrix: 
-1.3862943611198906 -1.0986122886681098 -1.791759469228055 -1.3862943611198906
-1.1939224684724343 -1.0986122886681096 -1.7047480922384253 -1.7047480922384253

My confusion is -> why theta matrix is a 2x4 matrix? By reading the codes, I 
think: 
1) Pi is the probability value of each label (also processed with math log), 
right?
2) theta matrix should be the P(FeatureValuei | label)/P(FeatureValuei) value 
matrix (like: P(Weather=Cloudy|Decision=Yes)/P(Weather=Cloudy), also process 
with math log) for each feature value on each label. 
  2.1) so when I predict on a test data: 
  (Cloudy, Normal, Normal, No)
  2.2) it will compare P(decision=yes|Cloudy, Normal, Normal, No) and 
P(decision=no|Cloudy, Normal, Normal, No) to determine the decision, so the 
theta matrix will provide the data for each factor, like: P(Weather=Cloudy | 
decision=yes)/P(Weather=Cloudy).
  2.3) in this sample, weather feature has 3 values, Temperature has 3, 
Humidity has 2, Wind has 2, so theta matrix should have 10 columns, but now it 
only has 4 columns mapping to 4 features, so I am confused how it works.
  2.4) that's why I described my 1st data translation model above (translate to 
10 features by myself), in that way, the theta matrix is 2x10;

Hopefully, I described my confusion clearly on the NaiveBayesModel. 

Btw, I tested the new training data with DecisionTree.trainClassifier (still 
use MLUtils.loadLibSVMFile to load data), seems it requires label to be 0 and 
1, otherwise it will report error with the label -1.

> Provide sample format of training/test data in MLlib programming guide
> ----------------------------------------------------------------------
>
>                 Key: SPARK-4872
>                 URL: https://issues.apache.org/jira/browse/SPARK-4872
>             Project: Spark
>          Issue Type: Improvement
>          Components: Documentation
>    Affects Versions: 1.1.1
>            Reporter: zhang jun wei
>              Labels: documentation
>
> I suggest: in samples of the online programming guide of MLlib, it's better 
> to give examples in the real life data, and list the translated data format 
> for the model to consume. 
> The problem blocking me is how to translate the real life data into the 
> format which MLLib  can understand correctly. 
> Here is one sample, I want to use NaiveBayes to train and predict tennis-play 
> decision, the original data is:
> Weather | Temperature | Humidity | Wind  => Decision to play tennis
> Sunny     | Hot               | High       | No     => No
> Sunny     | Hot               | High       | Yes    => No
> Cloudy    | Normal         | Normal   | No     => Yes
> Rainy      | Cold             | Normal   | Yes    => No
> Now, from my understanding, one potential translation is:
> 1) put every feature value word into a line:
> Sunny Cloudy Rainy Hot Normal Cold High Normal Yes No
> 2) map them to numbers:
> 1 2 3 4 5 6 7 8 9 10
> 3) map decision labels to numbers:
> 0 - No
> 1 - Yes
> 4) set the value to 1 if it appears, or 0 if not, for the above example, here 
> is the data format for MLUtils.loadLibSVMFile to use:
> 0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:0 10:1
> 0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:1 10:0
> 1 1:0 2:1 3:0 4:0 5:1 6:0 7:0 8:1 9:0 10:1
> 0 1:0 2:0 3:1 4:0 5:0 6:1 7:0 8:1 9:1 10:0
> ==> Is this a correct understanding?
> And another way I can image is:
> 1) put every feature name into a line:
> Weather  Temperature  Humidity  Wind
> 2) map them to numbers:
> 1 2 3 4 
> 3) map decision labels to numbers:
> 0 - No
> 1 - Yes
> 4) map each value of each feature to a number (e.g. Sunny to 1, Cloudy to 2, 
> Rainy to 3; Hot to 1, Normal to 2, Cold to 3; High to 1, Normal to 2; Yes to 
> 1, No to 2) for the above example, here is the data format for 
> MLUtils.loadLibSVMFile to use:
> 0 1:1 2:1 3:1 4:2
> 0 1:1 2:1 3:1 4:1
> 1 1:2 2:2 3:2 4:2
> 0 1:3 2:3 3:2 4:1
> ==> but when I read the source code in NaiveBayes.scala, seems this is not 
> correct, I am not sure though...
> So which data format translation way is correct?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-4872) Provide sample format of training/test data in MLlib programming guide

Reply via email to