[GitHub] spark pull request: SPARK-6698: where RandomForest input specifies...

bien Sat, 04 Apr 2015 21:07:27 -0700

Github user bien commented on the pull request:

    https://github.com/apache/spark/pull/5351#issuecomment-89713636
  
    The behavior I was seeing was that RandomTree training tasks were spending 
~90% of their time doing GC, and when I turned on verbose GC I would see that 
most of the time was spent (fruitlessly) on older generation objects.  I 
assumed the baggedInput RDD was the culprit because there were no other RDDs in 
my code (other than the original input), and this patch did help things 
somewhat.  Under these circumstances I don't have a problem spending time 
deserializing objects or creating objects in the younger generation.  
    
    >> An explicit parameter with a reasonable default might be better than 
making users persist RDDs as a way of specifying the parameter
    
    This sounds fine to me but I don't know the Spark codebase well enough to 
contribute this.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: SPARK-6698: where RandomForest input specifies...

Reply via email to