Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-54252929
  
    @jkbradley I'm a little worry about the memory usage, but this is not a 
regression caused by this PR (ignoring the bug I commented inline). I'm testing 
in on MNIST digits, which has `10` classes and `28 * 28 = 784` continuous 
features. The default `maxBins` is `100`. So the total array size is `10 * 100 
* 784 = 784000` per node, i.e., 6MB. Under the default setting (128MB), we can 
only train 16 nodes per group. Two questions:
    
    1. Is the default `maxBins = 100` too high and the default `maxMemoryInMB = 
128` too low? If we change `maxBins` to `32` and `maxMemoryInMB` to 256, we can 
train ~128 nodes in one pass.
    
    2. Can we compute the best splits distributively? We can join on the node 
index and send all stats associated with a single node to a single executor and 
find the best split there. It will help reduce the memory requirement as well 
as driver's load.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to