[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

manishamde Wed, 20 Aug 2014 14:57:12 -0700

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2063#discussion_r16508593
  
    --- Diff: docs/mllib-decision-tree.md ---
    @@ -62,14 +69,15 @@ datasets `$D_{left}$` and `$D_{right}$` of sizes 
`$N_{left}$` and `$N_{right}$`,
     
     **Continuous features**
     
    -For small datasets in single machine implementations, the split candidates 
for each continuous
    +For small datasets in single-machine implementations, the split candidates 
for each continuous
     feature are typically the unique values for the feature. Some 
implementations sort the feature
     values and then use the ordered unique values as split candidates for 
faster tree calculations.
     
    -Finding ordered unique feature values is computationally intensive for 
large distributed
    -datasets. One can get an approximate set of split candidates by performing 
a quantile calculation
    -over a sampled fraction of the data. The ordered splits create "bins" and 
the maximum number of such
    -bins can be specified using the `maxBins` parameters.
    +Sorting feature values is expensive for large distributed datasets.
    --- End diff --
    
    Sounds good.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Reply via email to