Github user LIDIAgroup commented on the pull request: https://github.com/apache/spark/pull/216#issuecomment-38718686 I'll make some changes that, imho, will improve the discretizer in some aspects: 1. I'll change the accumulator from a `Map` to an `Array`. This implies collecting all different labels and mapping them to sequential `Int`s at the beginning and reversing that mapping at the end. I wanted to avoid this step. But after reading @mengxr complaints about the complexity of `MapAccumulator`, I think it's worthy to do it. 2. I propose changing the interface to one `train(data, featureIndexes)` function that works out thresholds to be applied to each feature and stores them and `discretize(data)` that will apply the thresholds to the data given. The possibility of training and discretizing data being different is quite limited, since the indexes of the features to discretize have to be the same. But probably this way is clearer. 3. My intention with `discretize` returning an `RDD[_]` was that future discretizers could be applied to data other than `LabeledPoint`s (in fact, it is placed under `mllib.regression`, so it doesn't seem to be a standard.) The reason is that labels, unlike the present case, are not always needed to discretize. I'm going to work now on point 1, but I'll be pleased to have your opinion about the other points. If you have a different proposal for what the interface of the discretizer should be, I'm willing to discuss it and try to stick to it. PS: I'm very sorry with my mistakes concerning code conventions. I'll be more cautious on that in the future.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---