Olivier Sannier created SPARK-23709:
---------------------------------------

             Summary: BaggedPoint.convertToBaggedRDDSamplingWithReplacement 
does not guarantee the sum of weights
                 Key: SPARK-23709
                 URL: https://issues.apache.org/jira/browse/SPARK-23709
             Project: Spark
          Issue Type: Question
          Components: ML
    Affects Versions: 2.1.1
            Reporter: Olivier Sannier


When using a bagging method like RandomForest, the theory dictates that the 
source dataset is copied over with a subsample of rows.

To avoid excessive memory usage, Spark uses the BaggedPoint concept where each 
row is associated to a weight for the final dataset, ie for each tree asked for 
the RandomForest.

RandomForest requires that the dataset for each tree is a random draw with 
replacement from the source data, that has the same size as the source data.

However, during investigations, we found out that the count value used to 
compute the variance is not always equal to the source data count, it is 
sometimes less, sometimes more.

I went digging in the source and found the 
BaggedPoint.convertToBaggedRDDSamplingWithReplacement method which uses a 
Poisson distribution to assign a weight to each row. And this distribution does 
not guarantee that the total of weights for a given tree is equal to the source 
dataset count.

Looking around in here, it seems this is done for performance reasons because 
the approximation it gives is good enough, especially when dealing with very 
large datasets.

However, I could not find any documentation that clearly explains this. Would 
you have any link on the subject?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to