[
https://issues.apache.org/jira/browse/SPARK-23709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-23709.
-------------------------------
Resolution: Not A Problem
This is a question for the mailing list rather than JIRA at this stage.
> BaggedPoint.convertToBaggedRDDSamplingWithReplacement does not guarantee the
> sum of weights
> -------------------------------------------------------------------------------------------
>
> Key: SPARK-23709
> URL: https://issues.apache.org/jira/browse/SPARK-23709
> Project: Spark
> Issue Type: Question
> Components: ML
> Affects Versions: 2.1.1
> Reporter: Olivier Sannier
> Priority: Critical
>
> When using a bagging method like RandomForest, the theory dictates that the
> source dataset is copied over with a subsample of rows.
> To avoid excessive memory usage, Spark uses the BaggedPoint concept where
> each row is associated to a weight for the final dataset, ie for each tree
> asked for the RandomForest.
> RandomForest requires that the dataset for each tree is a random draw with
> replacement from the source data, that has the same size as the source data.
> However, during investigations, we found out that the count value used to
> compute the variance is not always equal to the source data count, it is
> sometimes less, sometimes more.
> I went digging in the source and found the
> BaggedPoint.convertToBaggedRDDSamplingWithReplacement method which uses a
> Poisson distribution to assign a weight to each row. And this distribution
> does not guarantee that the total of weights for a given tree is equal to the
> source dataset count.
> Looking around in here, it seems this is done for performance reasons because
> the approximation it gives is good enough, especially when dealing with very
> large datasets.
> However, I could not find any documentation that clearly explains this. Would
> you have any link on the subject?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]