[
https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426549#comment-15426549
]
Barry Becker commented on SPARK-17086:
--------------------------------------
I think I agree with the discussion. Here is a summary of the conclusions just
to check my understanding:
- It's fine for appxQuantile to return duplicate splits. It should always
return the requested number of quantiles corresponding to the length of the
probabilities array pased to it.
- QuantileBucketizer, on the other hand, may return fewer than the number of
buckets requested. It should not give an error when the number of buckets
requested is fewer than the number of distinct values. If the call to
appxQuartile returns duplicate splits, just discard the duplicates when passing
the splits to QBucketizer. This saves you from having to compute unique values
first in order to check to see if that number is less that the requested number
of bins. I think its fine that QBucketizer work this way. You want it to be
robust and not give errors for edge cases like this. The objective is to return
buckets that have as close to equal weight bins as possible with simple split
values.
If the data was \[1,1,1,1,1,1,1,1,4,5,10\] and I asked for 10 bins, then I
would expect the splits to be \[-Inf, 1, 4, 5, 10, Inf\] even though the mean
is 1 and appxQuartile returned 1 repeated several time. If I asked for 2 bins,
then I think the splits might be \[-Inf, 1, 4, Inf\]. If three bins are
requested, would you get \[-Inf, 1, 4, 5, Inf] or [-Inf, 1, 4, 10, Inf\]?
Maybe, in cases like this you should get \[-Inf, 1, 4, 5, 10, Inf\] even though
only 3 bins were requested. In other words, if there are only a small number of
unique integer values in the data, and the number of bins is slightly less than
that number, maybe it should be increased to match it since that is likely to
be more meaningful. For now, just removing duplicates is probably enough.
> QuantileDiscretizer throws InvalidArgumentException (parameter splits given
> invalid value) on valid data
> --------------------------------------------------------------------------------------------------------
>
> Key: SPARK-17086
> URL: https://issues.apache.org/jira/browse/SPARK-17086
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 2.1.0
> Reporter: Barry Becker
>
> I discovered this bug when working with a build from the master branch (which
> I believe is 2.1.0). This used to work fine when running spark 1.6.2.
> I have a dataframe with an "intData" column that has values like
> {code}
> 1 3 2 1 1 2 3 2 2 2 1 3
> {code}
> I have a stage in my pipeline that uses the QuantileDiscretizer to produce
> equal weight splits like this
> {code}
> new QuantileDiscretizer()
> .setInputCol("intData")
> .setOutputCol("intData_bin")
> .setNumBuckets(10)
> .fit(df)
> {code}
> But when that gets run it (incorrectly) throws this error:
> {code}
> parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0,
> 3.0, Infinity]
> {code}
> I don't think that there should be duplicate splits generated should there be?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]