Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/2125#issuecomment-54255291
@mengxr I agree memory limits are a problem.
1. I am OK with changing maxBins to 32. For maxMemoryInMB, does 256 seem
reasonable when the default executor memory is 512?
2. For choosing the best splits, I agree we could distribute it, but I'm
not sure about the gains. The memory requirements might be hard to reduce
given the current setup, for each executor needs to update bins for all nodes
during its one pass over its data points. If we maintained a mapping from
nodes to relevant examples, then I could imagine spawning one job per node;
that would reduce the memory requirement but might mean lots more passes over
the data. As far as the driver's load, fairly little time is spent outside of
aggregation, so I don't think it's a big issue. Am I misunderstanding though?
I could imagine 2 main ways to reduce memory usage without doing more
passes over the data:
(a) Simple way: We could use different types to compress data, as others
have done.
(b) Fancy way: We could use many fewer maxBins by default but re-bin
features at each node to ameliorate the effects of a small number of bins.
E.g., the top node might split a continuous feature into bins [0, 20], [20,
40], ... and choose to split at 20; in the next level, the left node might use
bins [0, 5], [5, 10], ... [15, 20], and the right node might use bins [20, 30],
[30, 40], ....
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]