[ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882216#comment-15882216
 ] 

Nick Pentreath edited comment on SPARK-19714 at 2/24/17 8:35 AM:
-----------------------------------------------------------------

I agree that the parameter naming is perhaps misleading. At least the doc 
should be updated because "invalid" here actually means {{NaN}} or {{null}}. 

However {{Bucketizer}} is doing what you tell it to as the splits are specified 
by you. Note that if you used {{QuantileDiscretizer}} to construct the 
{{Bucketizer}} then it adds {{+/- Infinity}} as the lower/upper bounds of the 
splits. So you can do the same if you want anything below the lower bound or 
above the lower bound to be "valid". You will then have 2 more buckets.


was (Author: mlnick):
I agree that the parameter naming is perhaps misleading. At least the doc 
should be updated because "invalid" here actually means {{NaN}} or {{null}}. 

However {{Bucketizer}} is doing what you tell it to as the splits are specified 
by you. Note that if you used {{QuantileDiscretizer}} to construct the 
{{Bucketizer}} then it adds {{+/- Infinity}} as the lower/upper bounds of the 
splits. So you can do the same if you want anything below the lower bound to be 
included in the first bucket, and above the upper bound to be included in the 
last bucket.

> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---------------------------------------------------
>
>                 Key: SPARK-19714
>                 URL: https://issues.apache.org/jira/browse/SPARK-19714
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>    Affects Versions: 2.1.0
>            Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to