Github user sethah commented on a diff in the pull request:
https://github.com/apache/spark/pull/7655#discussion_r35663441
--- Diff: docs/mllib-evaluation-metrics.md ---
@@ -0,0 +1,1475 @@
+---
+layout: global
+title: Evaluation Metrics - MLlib
+displayTitle: <a href="mllib-guide.html">MLlib</a> - Evaluation Metrics
+---
+
+* Table of contents
+{:toc}
+
+
+## Algorithm Metrics
+
+Spark's MLlib comes with a number of machine learning algorithms that can
be used to learn from and make predictions
+on data. When these algorithms are applied to build machine learning
models, there is a need to evaluate the performance
+of the model on some criteria, which depends on the application and its
requirements. Spark's MLlib also provides a
+suite of metrics for the purpose of evaluating the performance of machine
learning models.
+
+Specific machine learning algorithms fall under broader types of machine
learning applications like classification,
+regression, clustering, etc. Each of these types have well established
metrics for performance evaluation and those
+metrics that are currently available in Spark's MLlib are detailed in this
section.
+
+## Classification Model Evaluation
+
+While there are many different types of classification algorithms, the
evaluation of classification models all share
+similar principles. In a [supervised classification
problem](https://en.wikipedia.org/wiki/Statistical_classification),
+there exists a true output and a model-generated predicted output for each
data point. For this reason, the results for
+each data point can be assigned to one of four categories:
+
+* True Positive (TP) - class predicted by model and class in true output
+* True Negative (TN) - class not predicted by model and class not in true
output
+* False Positive (FP) - class predicted by model and class not in true
output
+* False Negative (FN) - class not predicted by model and class in true
output
+
+These four numbers are the building blocks for most classifier evaluation
metrics. A fundamental point when considering
+classifier evaluation is that pure accuracy (i.e. was the prediction
correct or incorrect) is not generally a good metric. The
+reason for this is because a dataset may be highly unbalanced. For
example, if a model is designed to predict fraud from
+a dataset where 95% of the data points are _not fraud_ and 5% of the data
points are _fraud_, then a naive classifier
+that predicts _not fraud_, regardless of input, will be 95% accurate. For
this reason, metrics like
+[precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall)
are typically used because they take into
+account the *type* of error. In most applications there is some desired
balance between precision and recall, which can
+be captured by combining the two into a single metric, called the
[F-measure](https://en.wikipedia.org/wiki/F1_score).
+
+### Binary Classification
+
+[Binary classifiers](https://en.wikipedia.org/wiki/Binary_classification)
are used to separate the elements of a given
+dataset into one of two possible groups (e.g. fraud or not fraud) and is a
special case of multiclass classification.
+Most binary classification metrics can be generalized to multiclass
classification metrics.
+
+#### Threshold Tuning
+
+It is important to understand that, in most classification models, the
output of the model is actually a probability or
--- End diff --
I agree and I've reworded with your points in mind.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]