Re: SparkML RandomForest java.lang.StackOverflowError

2016-04-01 Thread Joseph Bradley
Can you try reducing maxBins? That reduces communication (at the cost of coarser discretization of continuous features). On Fri, Apr 1, 2016 at 11:32 AM, Joseph Bradley wrote: > In my experience, 20K is a lot but often doable; 2K is easy; 200 is > small. Communication

Re: SparkML RandomForest java.lang.StackOverflowError

2016-04-01 Thread Joseph Bradley
In my experience, 20K is a lot but often doable; 2K is easy; 200 is small. Communication scales linearly in the number of features. On Thu, Mar 31, 2016 at 6:12 AM, Eugene Morozov wrote: > Joseph, > > Correction, there 20k features. Is it still a lot? > What number

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-31 Thread Eugene Morozov
Joseph, Correction, there 20k features. Is it still a lot? What number of features can be considered as normal? -- Be well! Jean Morozov On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley wrote: > First thought: 70K features is *a lot* for the MLlib implementation (and >

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-30 Thread Eugene Morozov
One more thing. With increased stack size it completed twice more already, but now I see in the log. [dispatcher-event-loop-1] WARN o.a.spark.scheduler.TaskSetManager - Stage 24860 contains a task of very large size (157 KB). The maximum recommended task size is 100 KB. Size of the task

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Eugene Morozov
Joseph, I'm using 1.6.0. -- Be well! Jean Morozov On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley wrote: > First thought: 70K features is *a lot* for the MLlib implementation (and > any PLANET-like implementation) > > Using fewer partitions is a good idea. > > Which

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Joseph Bradley
First thought: 70K features is *a lot* for the MLlib implementation (and any PLANET-like implementation) Using fewer partitions is a good idea. Which Spark version was this on? On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov wrote: > The questions I have in mind: >

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Eugene Morozov
The questions I have in mind: Is it smth that the one might expect? From the stack trace itself it's not clear where does it come from. Is it an already known bug? Although I haven't found anything like that. Is it possible to configure something to workaround / avoid this? I'm not sure it's the

SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Eugene Morozov
Hi, I have a web service that provides rest api to train random forest algo. I train random forest on a 5 nodes spark cluster with enough memory - everything is cached (~22 GB). On a small datasets up to 100k samples everything is fine, but with the biggest one (400k samples and ~70k features)