[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

srowen Wed, 28 May 2014 02:04:32 -0700

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/886#issuecomment-44381347
  
    I don't have a reference, and I did look for one. I am sure it is not 
optimal, and not even that great as a greedy algorithm. Two low-entropy 
distributions over target values could be high-entropy when combined. 
    
    You could pick one feature value which makes the target lowest entropy, 
then pick the next one that would make the combined entropy of the target 
lowest, and so on. That amounts to testing n^2 instead of 2^n decisions.
    
    If the alternative is to fail, or spend years in computation, I think 
heuristics of some kind are a must. Even random selection of subsets is better 
than rejecting the problem entirely -- anything is better than that I think.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

Reply via email to