Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/2125#issuecomment-54252929
@jkbradley I'm a little worry about the memory usage, but this is not a
regression caused by this PR (ignoring the bug I commented inline). I'm testing
in on MNIST digits, which has `10` classes and `28 * 28 = 784` continuous
features. The default `maxBins` is `100`. So the total array size is `10 * 100
* 784 = 784000` per node, i.e., 6MB. Under the default setting (128MB), we can
only train 16 nodes per group. Two questions:
1. Is the default `maxBins = 100` too high and the default `maxMemoryInMB =
128` too low? If we change `maxBins` to `32` and `maxMemoryInMB` to 256, we can
train ~128 nodes in one pass.
2. Can we compute the best splits distributively? We can join on the node
index and send all stats associated with a single node to a single executor and
find the best split there. It will help reduce the memory requirement as well
as driver's load.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]