[GitHub] spark pull request: [SPARK-5119] java.lang.ArrayIndexOutOfBoundsEx...

jkbradley Mon, 19 Jan 2015 13:56:19 -0800

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3975#issuecomment-70568225
  
    @Lewuathe  You're right; it is an API change for either algorithm.  I was 
arguing for an API change in data loading.  However, thinking more about it, I 
think you're right to make the API change happen in DecisionTree itself.  (I've 
written out my thoughts in more detail below.)  Here are some questions which 
come up though:
    * Do we only re-label -1,+1 to 0,1?  Or do we re-label any set of values to 
0,1,2,...?
    * Does predict() return the original labels or the re-labeled ones?
    * Should we make this standard across all algorithms (changing even more 
APIs)?
    
    CC: @mengxr  What do you think?
    
    My various thoughts:
    
    *In the spark.mllib API:*
    * If DecisionTree relabels -1 to 0, then we must do 1 of the following for 
test-time prediction (and neither option is great):
      * Prediction outputs original labels.
        * This would require saving a label dictionary/index mapping between 
original labels and new labels.
        * Con: Some methods for computing test error will no longer work (e.g., 
sum_i (prediction_i - label_i)^2).
      * Prediction outputs 0-based indexed labels.
        * Con: Users will get confused when they test on held-out data which 
has not been relabeled.
    * LibSVM does not impose a standard for labels, but it really should.
    
    *In the spark.ml API:*
    Re-indexing should happen in DecisionTree.  This will be easier to do since 
SchemaRDD metadata can store the label dictionary/index.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-5119] java.lang.ArrayIndexOutOfBoundsEx...

Reply via email to