[
https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430009#comment-15430009
]
Yanbo Liang commented on SPARK-17086:
-------------------------------------
We should not throw exception in this case. If the number of distinct input
data is less than {{numBuckets}}, we will simply return an array with distinct
elements as splits. But we should not actually compute the number of distinct
input elements which is very expensive, we can collapse adjacent splits
produced by {{approxQuantile}} that are equal.
> QuantileDiscretizer throws InvalidArgumentException (parameter splits given
> invalid value) on valid data
> --------------------------------------------------------------------------------------------------------
>
> Key: SPARK-17086
> URL: https://issues.apache.org/jira/browse/SPARK-17086
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 2.1.0
> Reporter: Barry Becker
>
> I discovered this bug when working with a build from the master branch (which
> I believe is 2.1.0). This used to work fine when running spark 1.6.2.
> I have a dataframe with an "intData" column that has values like
> {code}
> 1 3 2 1 1 2 3 2 2 2 1 3
> {code}
> I have a stage in my pipeline that uses the QuantileDiscretizer to produce
> equal weight splits like this
> {code}
> new QuantileDiscretizer()
> .setInputCol("intData")
> .setOutputCol("intData_bin")
> .setNumBuckets(10)
> .fit(df)
> {code}
> But when that gets run it (incorrectly) throws this error:
> {code}
> parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0,
> 3.0, Infinity]
> {code}
> I don't think that there should be duplicate splits generated should there be?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]