[
https://issues.apache.org/jira/browse/SPARK-16840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-16840:
---------------------------------
Labels: ML MLLib, bulk-closed (was: ML MLLib,)
> Please save the aggregate term frequencies as part of the NaiveBayesModel
> -------------------------------------------------------------------------
>
> Key: SPARK-16840
> URL: https://issues.apache.org/jira/browse/SPARK-16840
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 1.6.2, 2.0.0
> Reporter: Barry Becker
> Priority: Major
> Labels: ML, MLLib,, bulk-closed
>
> I would like to visualize the structure of the NaiveBayes model in order to
> get additional insight into the patterns in the data. In order to do that I
> need the frequencies for each feature value per label.
> This exact information is computed in the NaiveBayes.run method (see
> "aggregated" variable), but then discarded when creating the model. Pi and
> theta are computed based on the aggregated frequency counts, but surprisingly
> those counts are not needed to apply the model. It would not add much to the
> model size to add these aggregated counts, but could be very useful for some
> applications of the model.
> {code}
> def run(data: RDD[LabeledPoint]): NaiveBayesModel = {
> :
> // Aggregates term frequencies per label.
> val aggregated = data.map(p => (p.label, p.features)).combineByKey[(Long,
> DenseVector)](
> createCombiner = (v: Vector) => {
> :
> },
> :
> new NaiveBayesModel(labels, pi, theta, modelType) // <- please include
> "aggregated" here.
> }
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]