Re: Random Forest hangs without trace of error

2017-05-30 Thread Morten Hornbech
 suggested reduce tree and 
>>> bins had rdd running on same size data with no issues.or send me 
>>> some sample code and data and I try it out on my ec2 instance ...
>>> Kr
>>> 
>>> On 10 Dec 2016 3:16 am, "Md. Rezaul Karim" <rezaul.ka...@insight-centre.org 
>>> <mailto:rezaul.ka...@insight-centre.org>> wrote:
>>> I had similar experience last week. Even I could not find any error trace. 
>>> 
>>> Later on, I did the following to get rid of the problem: 
>>> i) I downgraded to Spark 2.0.0 
>>> ii) Decreased the value of maxBins and maxDepth 
>>> 
>>> Additionally, make sure that you set the featureSubsetStrategy as "auto" to 
>>> let the algorithm choose the best feature subset strategy for your data. 
>>> Finally, set the impurity as "gini" for the information gain.
>>> 
>>> However, setting the value of no. of trees to just 1 does not give you 
>>> either real advantage of the forest neither better predictive performance. 
>>> 
>>> 
>>> 
>>> Best, 
>>> Karim 
>>> 
>>> 
>>> On Dec 9, 2016 11:29 PM, "mhornbech" <mor...@datasolvr.com 
>>> <mailto:mor...@datasolvr.com>> wrote:
>>> Hi
>>> 
>>> I have spent quite some time trying to debug an issue with the Random Forest
>>> algorithm on Spark 2.0.2. The input dataset is relatively large at around
>>> 600k rows and 200MB, but I use subsampling to make each tree manageable.
>>> However even with only 1 tree and a low sample rate of 0.05 the job hangs at
>>> one of the final stages (see attached). I have checked the logs on all
>>> executors and the driver and find no traces of error. Could it be a memory
>>> issue even though no error appears? The error does seem sporadic to some
>>> extent so I also wondered whether it could be a data issue, that only occurs
>>> if the subsample includes the bad data rows.
>>> 
>>> Please comment if you have a clue.
>>> 
>>> Morten
>>> 
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28192/Sk%C3%A6rmbillede_2016-12-10_kl.png
>>>  
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28192/Sk%C3%A6rmbillede_2016-12-10_kl.png>>
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-hangs-without-trace-of-error-tp28192.html
>>>  
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-hangs-without-trace-of-error-tp28192.html>
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com 
>>> <http://nabble.com/>.
>>> 
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>>> <mailto:user-unsubscr...@spark.apache.org>
>>> 
>> 
>> 
> 



Re: Random Forest hangs without trace of error

2017-05-30 Thread Sumona Routh
Hi Morten,
Were you able to resolve your issue with RandomForest? I am having similar
issues with a newly trained model (that does have larger number of trees,
smaller minInstancesPerNode, which is by design to produce the best
performing model).

I wanted to get some feedback on how you solved your issue before I post a
separate question.

Thanks!
Sumona

On Sun, Dec 11, 2016 at 4:10 AM Marco Mistroni <mmistr...@gmail.com> wrote:

> OK. Did u change spark version? Java/scala/python version?
> Have u tried with different versions of any of the above?
> Hope this helps
> Kr
>
> On 10 Dec 2016 10:37 pm, "Morten Hornbech" <mor...@datasolvr.com> wrote:
>
>> I haven’t actually experienced any non-determinism. We have nightly
>> integration tests comparing output from random forests with no variations.
>>
>> The workaround we will probably try is to split the dataset, either
>> randomly or on one of the variables, and then train a forest on each
>> partition, which should then be sufficiently small.
>>
>> I hope to be able to provide a good repro case in some weeks. If the
>> problem was in our own code I will also post it in this thread.
>>
>> Morten
>>
>> Den 10. dec. 2016 kl. 23.25 skrev Marco Mistroni <mmistr...@gmail.com>:
>>
>> Hello Morten
>> ok.
>> afaik there is a tiny bit of randomness in these ML algorithms (pls
>> anyone correct me if i m wrong).
>> In fact if you run your RDF code multiple times, it will not give you
>> EXACTLY the same results (though accuracy and errors should me more or less
>> similar)..at least this is what i found when playing around with
>> RDF and decision trees and other ML algorithms
>>
>> If RDF is not a must for your usecase, could you try 'scale back' to
>> Decision Trees and see if you still get intermittent failures?
>> this at least to exclude issues with the data
>>
>> hth
>>  marco
>>
>> On Sat, Dec 10, 2016 at 5:20 PM, Morten Hornbech <mor...@datasolvr.com>
>> wrote:
>>
>>> Already did. There are no issues with smaller samples. I am running this
>>> in a cluster of three t2.large instances on aws.
>>>
>>> I have tried to find the threshold where the error occurs, but it is not
>>> a single factor causing it. Input size and subsampling rate seems to be
>>> most significant, and number of trees the least.
>>>
>>> I have also tried running on a test frame of randomized numbers with the
>>> same number of rows, and could not reproduce the problem here.
>>>
>>> By the way maxDepth is 5 and maxBins is 32.
>>>
>>> I will probably need to leave this for a few weeks to focus on more
>>> short-term stuff, but I will write here if I solve it or reproduce it more
>>> consistently.
>>>
>>> Morten
>>>
>>> Den 10. dec. 2016 kl. 17.29 skrev Marco Mistroni <mmistr...@gmail.com>:
>>>
>>> Hi
>>>  Bring back samples to 1k range to debugor as suggested reduce tree
>>> and bins had rdd running on same size data with no issues.or send
>>> me some sample code and data and I try it out on my ec2 instance ...
>>> Kr
>>>
>>> On 10 Dec 2016 3:16 am, "Md. Rezaul Karim" <
>>> rezaul.ka...@insight-centre.org> wrote:
>>>
>>>> I had similar experience last week. Even I could not find any error
>>>> trace.
>>>>
>>>> Later on, I did the following to get rid of the problem:
>>>> i) I downgraded to Spark 2.0.0
>>>> ii) Decreased the value of maxBins and maxDepth
>>>>
>>>> Additionally, make sure that you set the featureSubsetStrategy as
>>>> "auto" to let the algorithm choose the best feature subset strategy
>>>> for your data. Finally, set the impurity as "gini" for the information
>>>> gain.
>>>>
>>>> However, setting the value of no. of trees to just 1 does not give you
>>>> either real advantage of the forest neither better predictive performance.
>>>>
>>>>
>>>>
>>>> Best,
>>>> Karim
>>>>
>>>>
>>>> On Dec 9, 2016 11:29 PM, "mhornbech" <mor...@datasolvr.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I have spent quite some time trying to debug an issue with the Random
>>>>> Forest
>>>>> algorithm on Spark 2.0.2. The input dataset is relatively large at
>>>>> around
>&g

Re: Random Forest hangs without trace of error

2016-12-11 Thread Marco Mistroni
OK. Did u change spark version? Java/scala/python version?
Have u tried with different versions of any of the above?
Hope this helps
Kr

On 10 Dec 2016 10:37 pm, "Morten Hornbech"  wrote:

> I haven’t actually experienced any non-determinism. We have nightly
> integration tests comparing output from random forests with no variations.
>
> The workaround we will probably try is to split the dataset, either
> randomly or on one of the variables, and then train a forest on each
> partition, which should then be sufficiently small.
>
> I hope to be able to provide a good repro case in some weeks. If the
> problem was in our own code I will also post it in this thread.
>
> Morten
>
> Den 10. dec. 2016 kl. 23.25 skrev Marco Mistroni :
>
> Hello Morten
> ok.
> afaik there is a tiny bit of randomness in these ML algorithms (pls anyone
> correct me if i m wrong).
> In fact if you run your RDF code multiple times, it will not give you
> EXACTLY the same results (though accuracy and errors should me more or less
> similar)..at least this is what i found when playing around with
> RDF and decision trees and other ML algorithms
>
> If RDF is not a must for your usecase, could you try 'scale back' to
> Decision Trees and see if you still get intermittent failures?
> this at least to exclude issues with the data
>
> hth
>  marco
>
> On Sat, Dec 10, 2016 at 5:20 PM, Morten Hornbech 
> wrote:
>
>> Already did. There are no issues with smaller samples. I am running this
>> in a cluster of three t2.large instances on aws.
>>
>> I have tried to find the threshold where the error occurs, but it is not
>> a single factor causing it. Input size and subsampling rate seems to be
>> most significant, and number of trees the least.
>>
>> I have also tried running on a test frame of randomized numbers with the
>> same number of rows, and could not reproduce the problem here.
>>
>> By the way maxDepth is 5 and maxBins is 32.
>>
>> I will probably need to leave this for a few weeks to focus on more
>> short-term stuff, but I will write here if I solve it or reproduce it more
>> consistently.
>>
>> Morten
>>
>> Den 10. dec. 2016 kl. 17.29 skrev Marco Mistroni :
>>
>> Hi
>>  Bring back samples to 1k range to debugor as suggested reduce tree
>> and bins had rdd running on same size data with no issues.or send
>> me some sample code and data and I try it out on my ec2 instance ...
>> Kr
>>
>> On 10 Dec 2016 3:16 am, "Md. Rezaul Karim" > rg> wrote:
>>
>>> I had similar experience last week. Even I could not find any error
>>> trace.
>>>
>>> Later on, I did the following to get rid of the problem:
>>> i) I downgraded to Spark 2.0.0
>>> ii) Decreased the value of maxBins and maxDepth
>>>
>>> Additionally, make sure that you set the featureSubsetStrategy as "auto" to
>>> let the algorithm choose the best feature subset strategy for your
>>> data. Finally, set the impurity as "gini" for the information gain.
>>>
>>> However, setting the value of no. of trees to just 1 does not give you
>>> either real advantage of the forest neither better predictive performance.
>>>
>>>
>>>
>>> Best,
>>> Karim
>>>
>>>
>>> On Dec 9, 2016 11:29 PM, "mhornbech"  wrote:
>>>
 Hi

 I have spent quite some time trying to debug an issue with the Random
 Forest
 algorithm on Spark 2.0.2. The input dataset is relatively large at
 around
 600k rows and 200MB, but I use subsampling to make each tree manageable.
 However even with only 1 tree and a low sample rate of 0.05 the job
 hangs at
 one of the final stages (see attached). I have checked the logs on all
 executors and the driver and find no traces of error. Could it be a
 memory
 issue even though no error appears? The error does seem sporadic to some
 extent so I also wondered whether it could be a data issue, that only
 occurs
 if the subsample includes the bad data rows.

 Please comment if you have a clue.

 Morten

 



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Random-Forest-hangs-without-trace-of-e
 rror-tp28192.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com
 .

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org


>>
>
>


Re: Random Forest hangs without trace of error

2016-12-10 Thread Morten Hornbech
I haven’t actually experienced any non-determinism. We have nightly integration 
tests comparing output from random forests with no variations.

The workaround we will probably try is to split the dataset, either randomly or 
on one of the variables, and then train a forest on each partition, which 
should then be sufficiently small.

I hope to be able to provide a good repro case in some weeks. If the problem 
was in our own code I will also post it in this thread.

Morten

> Den 10. dec. 2016 kl. 23.25 skrev Marco Mistroni <mmistr...@gmail.com>:
> 
> Hello Morten
> ok.
> afaik there is a tiny bit of randomness in these ML algorithms (pls anyone 
> correct me if i m wrong).
> In fact if you run your RDF code multiple times, it will not give you EXACTLY 
> the same results (though accuracy and errors should me more or less 
> similar)..at least this is what i found when playing around with 
> RDF and decision trees and other ML algorithms
> 
> If RDF is not a must for your usecase, could you try 'scale back' to Decision 
> Trees and see if you still get intermittent failures?
> this at least to exclude issues with the data
> 
> hth 
>  marco
> 
> On Sat, Dec 10, 2016 at 5:20 PM, Morten Hornbech <mor...@datasolvr.com 
> <mailto:mor...@datasolvr.com>> wrote:
> Already did. There are no issues with smaller samples. I am running this in a 
> cluster of three t2.large instances on aws.
> 
> I have tried to find the threshold where the error occurs, but it is not a 
> single factor causing it. Input size and subsampling rate seems to be most 
> significant, and number of trees the least.
> 
> I have also tried running on a test frame of randomized numbers with the same 
> number of rows, and could not reproduce the problem here.
> 
> By the way maxDepth is 5 and maxBins is 32.
> 
> I will probably need to leave this for a few weeks to focus on more 
> short-term stuff, but I will write here if I solve it or reproduce it more 
> consistently.
> 
> Morten
> 
>> Den 10. dec. 2016 kl. 17.29 skrev Marco Mistroni <mmistr...@gmail.com 
>> <mailto:mmistr...@gmail.com>>:
>> 
>> Hi
>>  Bring back samples to 1k range to debugor as suggested reduce tree and 
>> bins had rdd running on same size data with no issues.or send me 
>> some sample code and data and I try it out on my ec2 instance ...
>> Kr
>> 
>> On 10 Dec 2016 3:16 am, "Md. Rezaul Karim" <rezaul.ka...@insight-centre.org 
>> <mailto:rezaul.ka...@insight-centre.org>> wrote:
>> I had similar experience last week. Even I could not find any error trace. 
>> 
>> Later on, I did the following to get rid of the problem: 
>> i) I downgraded to Spark 2.0.0 
>> ii) Decreased the value of maxBins and maxDepth 
>> 
>> Additionally, make sure that you set the featureSubsetStrategy as "auto" to 
>> let the algorithm choose the best feature subset strategy for your data. 
>> Finally, set the impurity as "gini" for the information gain.
>> 
>> However, setting the value of no. of trees to just 1 does not give you 
>> either real advantage of the forest neither better predictive performance. 
>> 
>> 
>> 
>> Best, 
>> Karim 
>> 
>> 
>> On Dec 9, 2016 11:29 PM, "mhornbech" <mor...@datasolvr.com 
>> <mailto:mor...@datasolvr.com>> wrote:
>> Hi
>> 
>> I have spent quite some time trying to debug an issue with the Random Forest
>> algorithm on Spark 2.0.2. The input dataset is relatively large at around
>> 600k rows and 200MB, but I use subsampling to make each tree manageable.
>> However even with only 1 tree and a low sample rate of 0.05 the job hangs at
>> one of the final stages (see attached). I have checked the logs on all
>> executors and the driver and find no traces of error. Could it be a memory
>> issue even though no error appears? The error does seem sporadic to some
>> extent so I also wondered whether it could be a data issue, that only occurs
>> if the subsample includes the bad data rows.
>> 
>> Please comment if you have a clue.
>> 
>> Morten
>> 
>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28192/Sk%C3%A6rmbillede_2016-12-10_kl.png
>>  
>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28192/Sk%C3%A6rmbillede_2016-12-10_kl.png>>
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-hangs-without-trace-of-error-tp28192.html
>>  
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-hangs-without-trace-of-error-tp28192.html>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com 
>> <http://nabble.com/>.
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> 
> 
> 



Re: Random Forest hangs without trace of error

2016-12-10 Thread Marco Mistroni
Hello Morten
ok.
afaik there is a tiny bit of randomness in these ML algorithms (pls anyone
correct me if i m wrong).
In fact if you run your RDF code multiple times, it will not give you
EXACTLY the same results (though accuracy and errors should me more or less
similar)..at least this is what i found when playing around with
RDF and decision trees and other ML algorithms

If RDF is not a must for your usecase, could you try 'scale back' to
Decision Trees and see if you still get intermittent failures?
this at least to exclude issues with the data

hth
 marco

On Sat, Dec 10, 2016 at 5:20 PM, Morten Hornbech 
wrote:

> Already did. There are no issues with smaller samples. I am running this
> in a cluster of three t2.large instances on aws.
>
> I have tried to find the threshold where the error occurs, but it is not a
> single factor causing it. Input size and subsampling rate seems to be most
> significant, and number of trees the least.
>
> I have also tried running on a test frame of randomized numbers with the
> same number of rows, and could not reproduce the problem here.
>
> By the way maxDepth is 5 and maxBins is 32.
>
> I will probably need to leave this for a few weeks to focus on more
> short-term stuff, but I will write here if I solve it or reproduce it more
> consistently.
>
> Morten
>
> Den 10. dec. 2016 kl. 17.29 skrev Marco Mistroni :
>
> Hi
>  Bring back samples to 1k range to debugor as suggested reduce tree
> and bins had rdd running on same size data with no issues.or send
> me some sample code and data and I try it out on my ec2 instance ...
> Kr
>
> On 10 Dec 2016 3:16 am, "Md. Rezaul Karim"  org> wrote:
>
>> I had similar experience last week. Even I could not find any error
>> trace.
>>
>> Later on, I did the following to get rid of the problem:
>> i) I downgraded to Spark 2.0.0
>> ii) Decreased the value of maxBins and maxDepth
>>
>> Additionally, make sure that you set the featureSubsetStrategy as "auto" to
>> let the algorithm choose the best feature subset strategy for your data.
>> Finally, set the impurity as "gini" for the information gain.
>>
>> However, setting the value of no. of trees to just 1 does not give you
>> either real advantage of the forest neither better predictive performance.
>>
>>
>>
>> Best,
>> Karim
>>
>>
>> On Dec 9, 2016 11:29 PM, "mhornbech"  wrote:
>>
>>> Hi
>>>
>>> I have spent quite some time trying to debug an issue with the Random
>>> Forest
>>> algorithm on Spark 2.0.2. The input dataset is relatively large at around
>>> 600k rows and 200MB, but I use subsampling to make each tree manageable.
>>> However even with only 1 tree and a low sample rate of 0.05 the job
>>> hangs at
>>> one of the final stages (see attached). I have checked the logs on all
>>> executors and the driver and find no traces of error. Could it be a
>>> memory
>>> issue even though no error appears? The error does seem sporadic to some
>>> extent so I also wondered whether it could be a data issue, that only
>>> occurs
>>> if the subsample includes the bad data rows.
>>>
>>> Please comment if you have a clue.
>>>
>>> Morten
>>>
>>> >> 8192/Sk%C3%A6rmbillede_2016-12-10_kl.png>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Random-Forest-hangs-without-trace-of-e
>>> rror-tp28192.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>> .
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>


Re: Random Forest hangs without trace of error

2016-12-10 Thread Marco Mistroni
Hi
 Bring back samples to 1k range to debugor as suggested reduce tree and
bins had rdd running on same size data with no issues.or send me
some sample code and data and I try it out on my ec2 instance ...
Kr

On 10 Dec 2016 3:16 am, "Md. Rezaul Karim" 
wrote:

> I had similar experience last week. Even I could not find any error trace.
>
> Later on, I did the following to get rid of the problem:
> i) I downgraded to Spark 2.0.0
> ii) Decreased the value of maxBins and maxDepth
>
> Additionally, make sure that you set the featureSubsetStrategy as "auto" to
> let the algorithm choose the best feature subset strategy for your data.
> Finally, set the impurity as "gini" for the information gain.
>
> However, setting the value of no. of trees to just 1 does not give you
> either real advantage of the forest neither better predictive performance.
>
>
>
> Best,
> Karim
>
>
> On Dec 9, 2016 11:29 PM, "mhornbech"  wrote:
>
>> Hi
>>
>> I have spent quite some time trying to debug an issue with the Random
>> Forest
>> algorithm on Spark 2.0.2. The input dataset is relatively large at around
>> 600k rows and 200MB, but I use subsampling to make each tree manageable.
>> However even with only 1 tree and a low sample rate of 0.05 the job hangs
>> at
>> one of the final stages (see attached). I have checked the logs on all
>> executors and the driver and find no traces of error. Could it be a memory
>> issue even though no error appears? The error does seem sporadic to some
>> extent so I also wondered whether it could be a data issue, that only
>> occurs
>> if the subsample includes the bad data rows.
>>
>> Please comment if you have a clue.
>>
>> Morten
>>
>> > n28192/Sk%C3%A6rmbillede_2016-12-10_kl.png>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Random-Forest-hangs-without-trace-of-e
>> rror-tp28192.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Random Forest hangs without trace of error

2016-12-09 Thread Md. Rezaul Karim
I had similar experience last week. Even I could not find any error trace.

Later on, I did the following to get rid of the problem:
i) I downgraded to Spark 2.0.0
ii) Decreased the value of maxBins and maxDepth

Additionally, make sure that you set the featureSubsetStrategy as "auto" to
let the algorithm choose the best feature subset strategy for your data.
Finally, set the impurity as "gini" for the information gain.

However, setting the value of no. of trees to just 1 does not give you
either real advantage of the forest neither better predictive performance.



Best,
Karim


On Dec 9, 2016 11:29 PM, "mhornbech" <mor...@datasolvr.com> wrote:

> Hi
>
> I have spent quite some time trying to debug an issue with the Random
> Forest
> algorithm on Spark 2.0.2. The input dataset is relatively large at around
> 600k rows and 200MB, but I use subsampling to make each tree manageable.
> However even with only 1 tree and a low sample rate of 0.05 the job hangs
> at
> one of the final stages (see attached). I have checked the logs on all
> executors and the driver and find no traces of error. Could it be a memory
> issue even though no error appears? The error does seem sporadic to some
> extent so I also wondered whether it could be a data issue, that only
> occurs
> if the subsample includes the bad data rows.
>
> Please comment if you have a clue.
>
> Morten
>
> <http://apache-spark-user-list.1001560.n3.nabble.com/
> file/n28192/Sk%C3%A6rmbillede_2016-12-10_kl.png>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Random-Forest-hangs-without-trace-of-
> error-tp28192.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Random Forest hangs without trace of error

2016-12-09 Thread mhornbech
Hi

I have spent quite some time trying to debug an issue with the Random Forest
algorithm on Spark 2.0.2. The input dataset is relatively large at around
600k rows and 200MB, but I use subsampling to make each tree manageable.
However even with only 1 tree and a low sample rate of 0.05 the job hangs at
one of the final stages (see attached). I have checked the logs on all
executors and the driver and find no traces of error. Could it be a memory
issue even though no error appears? The error does seem sporadic to some
extent so I also wondered whether it could be a data issue, that only occurs
if the subsample includes the bad data rows. 

Please comment if you have a clue.

Morten

<http://apache-spark-user-list.1001560.n3.nabble.com/file/n28192/Sk%C3%A6rmbillede_2016-12-10_kl.png>
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-hangs-without-trace-of-error-tp28192.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org