[GitHub] spark pull request: SPARK-6698: where RandomForest input specifies...

jkbradley Fri, 03 Apr 2015 11:13:05 -0700

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5351#issuecomment-89378788
  
    Let me make sure I understand.  If baggedInput is persisted serialized, 
then I agree it would take less memory/disk space.  However, wouldn't it get 
deserialized on every iteration, creating lots of new objects on each 
iteration?  If you're seeing GC problems, are you sure it's from baggedInput?
    
    Stepping back, the issue of persisting is tough in MLlib since it's hard to 
know what the user would want.  I'd be on board with providing parameters which 
allow experts to set persistence levels for algorithm internals.  An explicit 
parameter with a reasonable default might be better than making users persist 
RDDs as a way of specifying the parameter.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: SPARK-6698: where RandomForest input specifies...

Reply via email to