[GitHub] spark pull request: [SPARK-7577][ML][doc] add bucketizer doc

jkbradley Thu, 28 May 2015 00:13:02 -0700

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6451#discussion_r31208098
  
    --- Diff: docs/ml-features.md ---
    @@ -789,6 +789,90 @@ scaledData = scalerModel.transform(dataFrame)
     </div>
     </div>
     
    +## Bucketizer
    +
    +`Bucketizer` transforms a column of continuous features to a column of 
feature buckets, where the buckets are specified by users. It takes a parameter:
    +
    +* `splits`: Parameter for mapping continuous features into buckets. With 
n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in 
the range [x,y) except the last bucket, which also includes y. Splits should be 
strictly increasing. Values at -inf, inf must be explicitly provided to cover 
all Double values; Otherwise, values outside the splits specified will be 
treated as errors. Two examples of `splits` are `Array(Double.NegativeInfinity, 
0.0, 1.0, Double.PositiveInfinity)` and `Array(0.0, 1.0, 2.0)`.
    +
    +Note that if you have no idea of the upper bound and lower bound of the 
targeted column, you would better add the `Double.NegativeInfinity` and 
`Double.PositiveInfinity` as the bounds of your splits to prevent a potenial 
out of Bucketizer bounds exception.
    +
    +Note also that the splits that you provided have to be in strictly 
increasing order, i.e. `s0 < s1 < s2 < ... < sn`.
    +
    +More details can be found in the API docs for 
[Bucketizer](api/scala/index.html#org.apache.spark.ml.feature.Bucketizer).
    +
    +The following example demonstrates how to bucketize a column of `Double`s 
into another index-wised column.
    +
    +<div class="codetabs">
    +<div data-lang="scala">
    +{% highlight scala %}
    +import org.apache.spark.ml.feature.Bucketizer
    +import org.apache.spark.sql.DataFrame
    +
    +// Since we know the bounds of data, there is no need to add -inf and inf.
    --- End diff --
    
    Could you actually have these examples use -inf and inf?  I'm afraid some 
people might copy the code without thinking and be confused when they get 
out-of-bounds errors.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-7577][ML][doc] add bucketizer doc

Reply via email to