[ 
https://issues.apache.org/jira/browse/SPARK-16840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-16840:
---------------------------------
    Labels: ML MLLib, bulk-closed  (was: ML MLLib,)

> Please save the aggregate term frequencies as part of the NaiveBayesModel
> -------------------------------------------------------------------------
>
>                 Key: SPARK-16840
>                 URL: https://issues.apache.org/jira/browse/SPARK-16840
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 1.6.2, 2.0.0
>            Reporter: Barry Becker
>            Priority: Major
>              Labels: ML, MLLib,, bulk-closed
>
> I would like to visualize the structure of the NaiveBayes model in order to 
> get additional insight into the patterns in the data. In order to do that I 
> need the frequencies for each feature value per label.
> This exact information is computed in the  NaiveBayes.run method (see 
> "aggregated" variable), but then discarded when creating the model. Pi and 
> theta are computed based on the aggregated frequency counts, but surprisingly 
> those counts are not needed to apply the model. It would not add much to the 
> model size to add these aggregated counts, but could be very useful for some 
> applications of the model.
> {code}
>   def run(data: RDD[LabeledPoint]): NaiveBayesModel = {
>      :
>     // Aggregates term frequencies per label.
>     val aggregated = data.map(p => (p.label, p.features)).combineByKey[(Long, 
> DenseVector)](
>       createCombiner = (v: Vector) => {
>         :
>       },
>     :
>     new NaiveBayesModel(labels, pi, theta, modelType) // <- please include 
> "aggregated" here.
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to