Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-54255291
  
    @mengxr  I agree memory limits are a problem.
    
    1.  I am OK with changing maxBins to 32.  For maxMemoryInMB, does 256 seem 
reasonable when the default executor memory is 512?
    
    2. For choosing the best splits, I agree we could distribute it, but I'm 
not sure about the gains.  The memory requirements might be hard to reduce 
given the current setup, for each executor needs to update bins for all nodes 
during its one pass over its data points.  If we maintained a mapping from 
nodes to relevant examples, then I could imagine spawning one job per node; 
that would reduce the memory requirement but might mean lots more passes over 
the data.  As far as the driver's load, fairly little time is spent outside of 
aggregation, so I don't think it's a big issue.  Am I misunderstanding though?
    
    I could imagine 2 main ways to reduce memory usage without doing more 
passes over the data:
    (a) Simple way: We could use different types to compress data, as others 
have done.
    (b) Fancy way: We could use many fewer maxBins by default but re-bin 
features at each node to ameliorate the effects of a small number of bins.  
E.g., the top node might split a continuous feature into bins [0, 20], [20, 
40], ... and choose to split at 20; in the next level, the left node might use 
bins [0, 5], [5, 10], ... [15, 20], and the right node might use bins [20, 30], 
[30, 40], ....


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to