Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/6451#discussion_r31208098
--- Diff: docs/ml-features.md ---
@@ -789,6 +789,90 @@ scaledData = scalerModel.transform(dataFrame)
</div>
</div>
+## Bucketizer
+
+`Bucketizer` transforms a column of continuous features to a column of
feature buckets, where the buckets are specified by users. It takes a parameter:
+
+* `splits`: Parameter for mapping continuous features into buckets. With
n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in
the range [x,y) except the last bucket, which also includes y. Splits should be
strictly increasing. Values at -inf, inf must be explicitly provided to cover
all Double values; Otherwise, values outside the splits specified will be
treated as errors. Two examples of `splits` are `Array(Double.NegativeInfinity,
0.0, 1.0, Double.PositiveInfinity)` and `Array(0.0, 1.0, 2.0)`.
+
+Note that if you have no idea of the upper bound and lower bound of the
targeted column, you would better add the `Double.NegativeInfinity` and
`Double.PositiveInfinity` as the bounds of your splits to prevent a potenial
out of Bucketizer bounds exception.
+
+Note also that the splits that you provided have to be in strictly
increasing order, i.e. `s0 < s1 < s2 < ... < sn`.
+
+More details can be found in the API docs for
[Bucketizer](api/scala/index.html#org.apache.spark.ml.feature.Bucketizer).
+
+The following example demonstrates how to bucketize a column of `Double`s
into another index-wised column.
+
+<div class="codetabs">
+<div data-lang="scala">
+{% highlight scala %}
+import org.apache.spark.ml.feature.Bucketizer
+import org.apache.spark.sql.DataFrame
+
+// Since we know the bounds of data, there is no need to add -inf and inf.
--- End diff --
Could you actually have these examples use -inf and inf? I'm afraid some
people might copy the code without thinking and be confused when they get
out-of-bounds errors.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]