Github user LIDIAgroup commented on the pull request:

    https://github.com/apache/spark/pull/216#issuecomment-38718686
  
    I'll make some changes that, imho, will improve the discretizer in some 
aspects:
    1. I'll change the accumulator from a `Map` to an `Array`. This implies 
collecting all different labels and mapping them to sequential `Int`s at the 
beginning and reversing that mapping at the end. I wanted to avoid this step. 
But after reading @mengxr complaints about the complexity of `MapAccumulator`, 
I think it's worthy to do it. 
    2. I propose changing the interface to one `train(data, featureIndexes)` 
function that works out thresholds to be applied to each feature and stores 
them and `discretize(data)` that will apply the thresholds to the data given. 
The possibility of training and discretizing data being different is quite 
limited, since the indexes of the features to discretize have to be the same. 
But probably this way is clearer.
    3. My intention with `discretize` returning an `RDD[_]` was that future 
discretizers could be applied to data other than `LabeledPoint`s (in fact, it 
is placed under `mllib.regression`, so it doesn't seem to be a standard.) The 
reason is that labels, unlike the present case,  are not always needed to 
discretize.
    
    I'm going to work now on point 1, but I'll be pleased to have your opinion 
about the other points. If you have a different proposal for what the interface 
of the discretizer should be, I'm willing to discuss it and try to stick to it.
    
    PS: I'm very sorry with my mistakes concerning code conventions. I'll be 
more cautious on that in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to