Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/5980#issuecomment-100331836
It's hard to say for use cases. It's true empty buckets might be a
problem. I'm OK with requiring explicit buckets for -inf, inf; let's just do
that, but we'll have to make sure we document that in the Scala doc + examples.
For values out of range, there are several behaviors the user might want:
* new bucket for all bad values
* closest bucket
* throw an error (to make user aware of bad data or assumptions)
We could eventually support all of these via a new parameter, but for now,
shall we choose 1? I suppose I'd vote for putting values in the closest bucket
(which will mean the first and last split values will be changed to -inf, inf
under the hood).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]