[ 
https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426246#comment-15426246
 ] 

Vincent commented on SPARK-17086:
---------------------------------

[~yanboliang] yes, actually that case was handled on spark-1.6.2
[~srowen] currently it will throw an illegal exception in this case:
java.lang.IllegalArgumentException: quantileDiscretizer_07696c9dca6c parameter 
splits given invalid value

so, what i'm doing now is to have a check before calling approxQuantile, that, 
if the distinct input data count is less than numBuckets, we will simply return 
an array with distinct elements as splits, for those cases where number of 
distinct input data is greater than numBuckets, we will just go to 
approxQuantile as the way it is now to generate a splits set. 

For example, with an input data shown in this case, we will output splits : 
[-Infinity, 1.0, 2.0, 3.0, Infinity]

what do you think? [~yanboliang] [~srowen]

> QuantileDiscretizer throws InvalidArgumentException (parameter splits given 
> invalid value) on valid data
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-17086
>                 URL: https://issues.apache.org/jira/browse/SPARK-17086
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Barry Becker
>
> I discovered this bug when working with a build from the master branch (which 
> I believe is 2.1.0). This used to work fine when running spark 1.6.2.
> I have a dataframe with an "intData" column that has values like 
> {code}
> 1 3 2 1 1 2 3 2 2 2 1 3
> {code}
> I have a stage in my pipeline that uses the QuantileDiscretizer to produce 
> equal weight splits like this
> {code}
> new QuantileDiscretizer()
>         .setInputCol("intData")
>         .setOutputCol("intData_bin")
>         .setNumBuckets(10)
>         .fit(df)
> {code}
> But when that gets run it (incorrectly) throws this error:
> {code}
> parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, 
> 3.0, Infinity]
> {code}
> I don't think that there should be duplicate splits generated should there be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to