Github user bien commented on the pull request:
https://github.com/apache/spark/pull/5351#issuecomment-89713636
The behavior I was seeing was that RandomTree training tasks were spending
~90% of their time doing GC, and when I turned on verbose GC I would see that
most of the time was spent (fruitlessly) on older generation objects. I
assumed the baggedInput RDD was the culprit because there were no other RDDs in
my code (other than the original input), and this patch did help things
somewhat. Under these circumstances I don't have a problem spending time
deserializing objects or creating objects in the younger generation.
>> An explicit parameter with a reasonable default might be better than
making users persist RDDs as a way of specifying the parameter
This sounds fine to me but I don't know the Spark codebase well enough to
contribute this.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]