Olivier Sannier created SPARK-23709:
---------------------------------------
Summary: BaggedPoint.convertToBaggedRDDSamplingWithReplacement
does not guarantee the sum of weights
Key: SPARK-23709
URL: https://issues.apache.org/jira/browse/SPARK-23709
Project: Spark
Issue Type: Question
Components: ML
Affects Versions: 2.1.1
Reporter: Olivier Sannier
When using a bagging method like RandomForest, the theory dictates that the
source dataset is copied over with a subsample of rows.
To avoid excessive memory usage, Spark uses the BaggedPoint concept where each
row is associated to a weight for the final dataset, ie for each tree asked for
the RandomForest.
RandomForest requires that the dataset for each tree is a random draw with
replacement from the source data, that has the same size as the source data.
However, during investigations, we found out that the count value used to
compute the variance is not always equal to the source data count, it is
sometimes less, sometimes more.
I went digging in the source and found the
BaggedPoint.convertToBaggedRDDSamplingWithReplacement method which uses a
Poisson distribution to assign a weight to each row. And this distribution does
not guarantee that the total of weights for a given tree is equal to the source
dataset count.
Looking around in here, it seems this is done for performance reasons because
the approximation it gives is good enough, especially when dealing with very
large datasets.
However, I could not find any documentation that clearly explains this. Would
you have any link on the subject?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]