[jira] [Commented] (SPARK-4872) Provide sample format of training/test data in MLlib programming guide

Sean Owen (JIRA) Thu, 18 Dec 2014 02:01:58 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251441#comment-14251441
 ]


Sean Owen commented on SPARK-4872:
----------------------------------

The format looks more like libsvm. You do need binary features to use Naive 
Bayes, but, your data set does not have binary features. You show 4 binary 
features here, which is fine, but, that's not your data set. I would expect you 
should 1-hot encode your features, right?

You have 2 classes to predict and 4 features, so theta is 2x4.

See the scaladoc. pi is the log of class priors, and theta is indeed the log of 
conditional class probabilities, but it's just P(FeatureValuei | label) since 
P(FeatureValuei) is constant for purposes of picking the most likely class.
What is computed is log(P(label | FeatureValuei)) using a dot product. You 
might want to look up how (multinomial) Naive Bayes works first.

Although you describe a data set with 10 features, your example shows 4. I 
think the confusion is that you have not encoded your data in a way that makes 
sense for the Naive Bayes classifier.

Yes you need 0/1 labels in general. You do not need to use libsvm format at 
all, and in fact I would not use it unless you are trying to consume an 
existing data set. Otherwise you have to do extra conversions.

I think the docs do explain all this. Some of it is just how these formats and 
classifiers work, which won't all be explained again by Spark docs. Maybe 
there's an example to distill from this; this example is not correct yet.

> Provide sample format of training/test data in MLlib programming guide
> ----------------------------------------------------------------------
>
>                 Key: SPARK-4872
>                 URL: https://issues.apache.org/jira/browse/SPARK-4872
>             Project: Spark
>          Issue Type: Improvement
>          Components: Documentation
>    Affects Versions: 1.1.1
>            Reporter: zhang jun wei
>              Labels: documentation
>
> I suggest: in samples of the online programming guide of MLlib, it's better 
> to give examples in the real life data, and list the translated data format 
> for the model to consume. 
> The problem blocking me is how to translate the real life data into the 
> format which MLLib  can understand correctly. 
> Here is one sample, I want to use NaiveBayes to train and predict tennis-play 
> decision, the original data is:
> Weather | Temperature | Humidity | Wind  => Decision to play tennis
> Sunny     | Hot               | High       | No     => No
> Sunny     | Hot               | High       | Yes    => No
> Cloudy    | Normal         | Normal   | No     => Yes
> Rainy      | Cold             | Normal   | Yes    => No
> Now, from my understanding, one potential translation is:
> 1) put every feature value word into a line:
> Sunny Cloudy Rainy Hot Normal Cold High Normal Yes No
> 2) map them to numbers:
> 1 2 3 4 5 6 7 8 9 10
> 3) map decision labels to numbers:
> 0 - No
> 1 - Yes
> 4) set the value to 1 if it appears, or 0 if not, for the above example, here 
> is the data format for MLUtils.loadLibSVMFile to use:
> 0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:0 10:1
> 0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:1 10:0
> 1 1:0 2:1 3:0 4:0 5:1 6:0 7:0 8:1 9:0 10:1
> 0 1:0 2:0 3:1 4:0 5:0 6:1 7:0 8:1 9:1 10:0
> ==> Is this a correct understanding?
> And another way I can image is:
> 1) put every feature name into a line:
> Weather  Temperature  Humidity  Wind
> 2) map them to numbers:
> 1 2 3 4 
> 3) map decision labels to numbers:
> 0 - No
> 1 - Yes
> 4) map each value of each feature to a number (e.g. Sunny to 1, Cloudy to 2, 
> Rainy to 3; Hot to 1, Normal to 2, Cold to 3; High to 1, Normal to 2; Yes to 
> 1, No to 2) for the above example, here is the data format for 
> MLUtils.loadLibSVMFile to use:
> 0 1:1 2:1 3:1 4:2
> 0 1:1 2:1 3:1 4:1
> 1 1:2 2:2 3:2 4:2
> 0 1:3 2:3 3:2 4:1
> ==> but when I read the source code in NaiveBayes.scala, seems this is not 
> correct, I am not sure though...
> So which data format translation way is correct?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-4872) Provide sample format of training/test data in MLlib programming guide

Reply via email to