Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/886#issuecomment-44381347
I don't have a reference, and I did look for one. I am sure it is not
optimal, and not even that great as a greedy algorithm. Two low-entropy
distributions over target values could be high-entropy when combined.
You could pick one feature value which makes the target lowest entropy,
then pick the next one that would make the combined entropy of the target
lowest, and so on. That amounts to testing n^2 instead of 2^n decisions.
If the alternative is to fail, or spend years in computation, I think
heuristics of some kind are a must. Even random selection of subsets is better
than rejecting the problem entirely -- anything is better than that I think.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---