Github user manishamde commented on a diff in the pull request:
https://github.com/apache/spark/pull/2063#discussion_r16508593
--- Diff: docs/mllib-decision-tree.md ---
@@ -62,14 +69,15 @@ datasets `$D_{left}$` and `$D_{right}$` of sizes
`$N_{left}$` and `$N_{right}$`,
**Continuous features**
-For small datasets in single machine implementations, the split candidates
for each continuous
+For small datasets in single-machine implementations, the split candidates
for each continuous
feature are typically the unique values for the feature. Some
implementations sort the feature
values and then use the ordered unique values as split candidates for
faster tree calculations.
-Finding ordered unique feature values is computationally intensive for
large distributed
-datasets. One can get an approximate set of split candidates by performing
a quantile calculation
-over a sampled fraction of the data. The ordered splits create "bins" and
the maximum number of such
-bins can be specified using the `maxBins` parameters.
+Sorting feature values is expensive for large distributed datasets.
--- End diff --
Sounds good.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]