[
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15885315#comment-15885315
]
Nick Pentreath commented on SPARK-19714:
----------------------------------------
I also agree that the naming of {{splits}} could be better, but for now we're
stuck with it. We could deprecate it and have a new param, but to me the param
doc is pretty clear and unambiguous about what it actually does. So that option
seems more confusing to users than it's worth.
Of course {{QuantileDiscretizer}} is different but the result is exactly a
{{Bucketizer}} - the discretizer computes what the actual values of the splits
should be. My point is that if you want to include values outside of the splits
(bucket boundaries) you need to be explicit and put Inf/-Inf in {{splits}}.
If you believe that the "invalid" handling should also include values outside
of the split range that can be discussed. Do you propose to include all values
outside the range in the special bucket (as is done for {{NaN}} now)?
> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---------------------------------------------------
>
> Key: SPARK-19714
> URL: https://issues.apache.org/jira/browse/SPARK-19714
> Project: Spark
> Issue Type: Bug
> Components: ML, MLlib
> Affects Versions: 2.1.0
> Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
> .setSplits(splits)
> .setInputCol("id")
> .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of
> Bucketizer bounds [5.0, 500.0]. Check your features, or loosen the
> lower/upper bound constraints.
> {code}
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]