[ 
https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426549#comment-15426549
 ] 

Barry Becker commented on SPARK-17086:
--------------------------------------

I think I agree with the discussion. Here is a summary of the conclusions just 
to check my understanding:
 - It's fine for appxQuantile to return duplicate splits. It should always 
return the requested number of quantiles corresponding to the length of the 
probabilities array pased to it.
 - QuantileBucketizer, on the other hand, may return fewer than the number of 
buckets requested. It should not give an error when the number of buckets 
requested is fewer than the number of distinct values. If the call to 
appxQuartile returns duplicate splits, just discard the duplicates when passing 
the splits to QBucketizer. This saves you from having to compute unique values 
first in order to check to see if that number is less that the requested number 
of bins. I think its fine that QBucketizer work this way. You want it to be 
robust and not give errors for edge cases like this. The objective is to return 
buckets that have as close to equal weight bins as possible with simple split 
values.

If the data was \[1,1,1,1,1,1,1,1,4,5,10\] and I asked for 10 bins, then I 
would expect the splits to be \[-Inf, 1, 4, 5, 10, Inf\] even though the mean 
is 1 and appxQuartile returned 1 repeated several time. If I asked for 2 bins, 
then I think the splits might be \[-Inf, 1, 4, Inf\]. If three bins are 
requested, would you get \[-Inf, 1, 4, 5, Inf] or [-Inf, 1, 4, 10, Inf\]? 
Maybe, in cases like this you should get \[-Inf, 1, 4, 5, 10, Inf\] even though 
only 3 bins were requested. In other words, if there are only a small number of 
unique integer values in the data, and the number of bins is slightly less than 
that number, maybe it should be increased to match it since that is likely to 
be more meaningful. For now, just removing duplicates is probably enough.

> QuantileDiscretizer throws InvalidArgumentException (parameter splits given 
> invalid value) on valid data
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-17086
>                 URL: https://issues.apache.org/jira/browse/SPARK-17086
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Barry Becker
>
> I discovered this bug when working with a build from the master branch (which 
> I believe is 2.1.0). This used to work fine when running spark 1.6.2.
> I have a dataframe with an "intData" column that has values like 
> {code}
> 1 3 2 1 1 2 3 2 2 2 1 3
> {code}
> I have a stage in my pipeline that uses the QuantileDiscretizer to produce 
> equal weight splits like this
> {code}
> new QuantileDiscretizer()
>         .setInputCol("intData")
>         .setOutputCol("intData_bin")
>         .setNumBuckets(10)
>         .fit(df)
> {code}
> But when that gets run it (incorrectly) throws this error:
> {code}
> parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, 
> 3.0, Infinity]
> {code}
> I don't think that there should be duplicate splits generated should there be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to