Re: RandomForest caching

2017-05-12 Thread madhu phatak
Hi,
I opened a jira.

https://issues.apache.org/jira/browse/SPARK-20723

Can some one have a look?

On Fri, Apr 28, 2017 at 1:34 PM, madhu phatak  wrote:

> Hi,
>
> I am testing RandomForestClassification with 50gb of data which is cached
> in memory. I have 64gb of ram, in which 28gb is used for original dataset
> caching.
>
> When I run random forest, it caches around 300GB of intermediate data
> which un caches the original dataset. This caching is triggered by below
> code in RandomForest.scala
>
> ```
> val baggedInput = BaggedPoint
>   .convertToBaggedRDD(treeInput, strategy.subsamplingRate,
> numTrees, withReplacement, seed)
>   .persist(StorageLevel.MEMORY_AND_DISK)
>
> ```
>
> As I don't have control over storage level, I cannot make sure original
> dataset stays in memory for other interactive tasks when random forest is
> running.
>
> Is it a good idea to make this storage level a user parameter? If so I can
> open a jira issue and give pr for the same.
>
> --
> Regards,
> Madhukara Phatak
> http://datamantra.io/
>



-- 
Regards,
Madhukara Phatak
http://datamantra.io/


RandomForest caching

2017-04-28 Thread madhu phatak
Hi,

I am testing RandomForestClassification with 50gb of data which is cached
in memory. I have 64gb of ram, in which 28gb is used for original dataset
caching.

When I run random forest, it caches around 300GB of intermediate data which
un caches the original dataset. This caching is triggered by below code in
RandomForest.scala

```
val baggedInput = BaggedPoint
  .convertToBaggedRDD(treeInput, strategy.subsamplingRate,
numTrees, withReplacement, seed)
  .persist(StorageLevel.MEMORY_AND_DISK)

```

As I don't have control over storage level, I cannot make sure original
dataset stays in memory for other interactive tasks when random forest is
running.

Is it a good idea to make this storage level a user parameter? If so I can
open a jira issue and give pr for the same.

-- 
Regards,
Madhukara Phatak
http://datamantra.io/