[ 
https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426185#comment-15426185
 ] 

Sean Owen commented on SPARK-17086:
-----------------------------------

The issue is that approxQuantile may return quantiles where neighboring values 
are the same. This much is correct, really. It's perfectly possible for two 
different quantiles to have identical values. However this is used as input to 
the discretizer's split parameter, which wants strictly increasing quantiles. I 
think this requirement should just be relaxed. It's not invalid for two 
successive quantiles to be equal.

The bucket defined by [1.0, 1.0) will only receive the value 1.0, but that's 
fine. In a case like this, the bucket [1.0,2.0) will actually never get any 
values. But that's fine too. It may be an unuseful result but it's correct 
given the request to create 10 buckets.

> QuantileDiscretizer throws InvalidArgumentException (parameter splits given 
> invalid value) on valid data
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-17086
>                 URL: https://issues.apache.org/jira/browse/SPARK-17086
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Barry Becker
>
> I discovered this bug when working with a build from the master branch (which 
> I believe is 2.1.0). This used to work fine when running spark 1.6.2.
> I have a dataframe with an "intData" column that has values like 
> {code}
> 1 3 2 1 1 2 3 2 2 2 1 3
> {code}
> I have a stage in my pipeline that uses the QuantileDiscretizer to produce 
> equal weight splits like this
> {code}
> new QuantileDiscretizer()
>         .setInputCol("intData")
>         .setOutputCol("intData_bin")
>         .setNumBuckets(10)
>         .fit(df)
> {code}
> But when that gets run it (incorrectly) throws this error:
> {code}
> parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, 
> 3.0, Infinity]
> {code}
> I don't think that there should be duplicate splits generated should there be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to