[ 
https://issues.apache.org/jira/browse/SPARK-31332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanley Poon updated SPARK-31332:
---------------------------------
    Description: 
h3. Background

The RandomForest model does not provide proximity measure as described in 
[Breiman|[https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]]. 
There are many important use cases of proximity:
 - more accurate replacement for missing data

 - identify outliers

 - clustering or multi-dimensional scaling

 - compute the proximities of test set in the training set

 - unsupervised learning

Performance and storage concerns are among reasons that proximities are not 
computed and kept during prediction, as mentioned in 
[https://dzone.com/articles/classification-using-random-forest-with-spark-20.|https://dzone.com/articles/classification-using-random-forest-with-spark-20]
h3. Proposal

RF in Spark is optimized for massive scalability on large-scale dataset where 
the number of data points, features and trees can be very big. Even with 
optimized storage of NxT, it may still not fit in memory, where N is number of 
data points and T is number of trees in the forest.

We propose to add a column in the prediction output to return the node-id (or 
hash) of the terminal node for each sample data point.

The required changes on the current RF implementation will not increase the 
computation and storage by significant amounts. And it will leave the 
possibility open for computing some form of proximity after prediction. It us 
up to the users how to use the extra column of node-ids. Without this, 
currently there is no work around to compute proximity measure.
h4. Experiment on Spark 2.3.1 and 2.4.5

In one prototype, we output the terminal node id for each prediction from 
RandomForestClassificationModel. And then we use Spark’s LSHModel to cluster 
prediction results by terminal node ids. The performance of the whole pipeline 
was reasonable for the size of our dataset.
h3. References
 * L. Breiman. Manual on setting up, using, and understanding random forests 
v3.1, 2002. [https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]
 * [https://dzone.com/articles/classification-using-random-forest-with-spark-20]

  was:
h3. Background

The RandomForest model does not provide proximity measure as described in 
[Breiman|[https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]]. 
There are many important use cases of proximity:
 - more accurate replacement for missing data

 - identify outliers

 - clustering or multi-dimensional scaling

 - compute the proximities of test set in the training set

 - unsupervised learning

Performance and storage concerns are among reasons that proximities are not 
computed and kept during prediction, as mentioned 
[here|[https://dzone.com/articles/classification-using-random-forest-with-spark-20]].
h3. Proposal

RF in Spark is optimized for massive scalability on large-scale dataset where 
the number of data points, features and trees can be very big. Even with 
optimized storage of NxT, it may still not fit in memory, where N is number of 
data points and T is number of trees in the forest.

We propose to add a column in the prediction output to return the node-id (or 
hash) of the terminal node for each sample data point.

The required changes on the current RF implementation will not increase the 
computation and storage by significant amounts. And it will leave the 
possibility open for computing some form of proximity after prediction. It us 
up to the users how to use the extra column of node-ids. Without this, 
currently there is no work around to compute proximity measure.
h4. Experiment on Spark 2.3.1 and 2.4.5

In one prototype, we output the terminal node id for each prediction from 
RandomForestClassificationModel. And then we use Spark’s LSHModel to cluster 
prediction results by terminal node ids. The performance of the whole pipeline 
was reasonable for the size of our dataset.
h3. References
 * L. Breiman. Manual on setting up, using, and understanding random forests 
v3.1, 2002. [https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]
 * [https://dzone.com/articles/classification-using-random-forest-with-spark-20]


> Proposal to add Proximity Measure in Random Forest
> --------------------------------------------------
>
>                 Key: SPARK-31332
>                 URL: https://issues.apache.org/jira/browse/SPARK-31332
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.4.5
>         Environment: The proposal should apply to any Spark version and OS's 
> that are supported by Spark.
> Specifically, the observations reported were based on:
>  * Spark 2.3.1 and 2.4.5
>  * Ubuntu 16.04.6 LTS
>  * Mac OS 10.13.6
>  
>            Reporter: Stanley Poon
>            Priority: Major
>              Labels: Proximity, RandomForest, ml
>
> h3. Background
> The RandomForest model does not provide proximity measure as described in 
> [Breiman|[https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]]. 
> There are many important use cases of proximity:
>  - more accurate replacement for missing data
>  - identify outliers
>  - clustering or multi-dimensional scaling
>  - compute the proximities of test set in the training set
>  - unsupervised learning
> Performance and storage concerns are among reasons that proximities are not 
> computed and kept during prediction, as mentioned in 
> [https://dzone.com/articles/classification-using-random-forest-with-spark-20.|https://dzone.com/articles/classification-using-random-forest-with-spark-20]
> h3. Proposal
> RF in Spark is optimized for massive scalability on large-scale dataset where 
> the number of data points, features and trees can be very big. Even with 
> optimized storage of NxT, it may still not fit in memory, where N is number 
> of data points and T is number of trees in the forest.
> We propose to add a column in the prediction output to return the node-id (or 
> hash) of the terminal node for each sample data point.
> The required changes on the current RF implementation will not increase the 
> computation and storage by significant amounts. And it will leave the 
> possibility open for computing some form of proximity after prediction. It us 
> up to the users how to use the extra column of node-ids. Without this, 
> currently there is no work around to compute proximity measure.
> h4. Experiment on Spark 2.3.1 and 2.4.5
> In one prototype, we output the terminal node id for each prediction from 
> RandomForestClassificationModel. And then we use Spark’s LSHModel to cluster 
> prediction results by terminal node ids. The performance of the whole 
> pipeline was reasonable for the size of our dataset.
> h3. References
>  * L. Breiman. Manual on setting up, using, and understanding random forests 
> v3.1, 2002. [https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]
>  * 
> [https://dzone.com/articles/classification-using-random-forest-with-spark-20]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to