Github user sethah commented on a diff in the pull request:
https://github.com/apache/spark/pull/7655#discussion_r35829175
--- Diff: docs/mllib-evaluation-metrics.md ---
@@ -0,0 +1,1476 @@
+---
+layout: global
+title: Evaluation Metrics - MLlib
+displayTitle: <a href="mllib-guide.html">MLlib</a> - Evaluation Metrics
+---
+
+* Table of contents
+{:toc}
+
+
+## Algorithm Metrics
+
+Spark's MLlib comes with a number of machine learning algorithms that can
be used to learn from and make predictions
+on data. When these algorithms are applied to build machine learning
models, there is a need to evaluate the performance
+of the model on some criteria, which depends on the application and its
requirements. Spark's MLlib also provides a
+suite of metrics for the purpose of evaluating the performance of machine
learning models.
+
+Specific machine learning algorithms fall under broader types of machine
learning applications like classification,
+regression, clustering, etc. Each of these types have well established
metrics for performance evaluation and those
+metrics that are currently available in Spark's MLlib are detailed in this
section.
+
+## Classification Model Evaluation
+
+While there are many different types of classification algorithms, the
evaluation of classification models all share
+similar principles. In a [supervised classification
problem](https://en.wikipedia.org/wiki/Statistical_classification),
+there exists a true output and a model-generated predicted output for each
data point. For this reason, the results for
+each data point can be assigned to one of four categories:
+
+* True Positive (TP) - class predicted by model and class in true output
+* True Negative (TN) - class not predicted by model and class not in true
output
+* False Positive (FP) - class predicted by model and class not in true
output
+* False Negative (FN) - class not predicted by model and class in true
output
+
+These four numbers are the building blocks for most classifier evaluation
metrics. A fundamental point when considering
+classifier evaluation is that pure accuracy (i.e. was the prediction
correct or incorrect) is not generally a good metric. The
+reason for this is because a dataset may be highly unbalanced. For
example, if a model is designed to predict fraud from
+a dataset where 95% of the data points are _not fraud_ and 5% of the data
points are _fraud_, then a naive classifier
+that predicts _not fraud_, regardless of input, will be 95% accurate. For
this reason, metrics like
+[precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall)
are typically used because they take into
+account the *type* of error. In most applications there is some desired
balance between precision and recall, which can
+be captured by combining the two into a single metric, called the
[F-measure](https://en.wikipedia.org/wiki/F1_score).
+
+### Binary Classification
+
+[Binary classifiers](https://en.wikipedia.org/wiki/Binary_classification)
are used to separate the elements of a given
+dataset into one of two possible groups (e.g. fraud or not fraud) and is a
special case of multiclass classification.
+Most binary classification metrics can be generalized to multiclass
classification metrics.
+
+#### Threshold Tuning
+
+It is import to understand that many classification models actually output
a "score" (often times a probability) for
+each class, where a higher score indicates higher likelihood. In the
binary case, the model may output a probability for
+each class: $P(Y=1|X)$ and $P(Y=0|X)$. Instead of simply taking the higher
probability, there may be some cases where
+the model might need to be tuned so that it only predicts a class when the
probability is very high (e.g. only block a
+credit card transaction if the model predicts fraud with >90%
probability). Therefore, there is a prediction *threshold*
+which determines what the predicted class will be based on the
probabilities that the model outputs.
+
+Tuning the prediction threshold will change the precision and recall of
the model and is an important part of model
+optimization. In order to visualize how precision, recall, and other
metrics change as a function of the threshold it is
+common practice to plot competing metrics against one another,
parameterized by threshold. A P-R curve plots (precision,
+recall) points for different threshold values, while a [receiver operating
characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic),
or ROC,
+curve plots (recall, false positive rate) points.
+
+**Available Metrics**
+
+<table class="table">
+ <thead>
+ <tr><th>Metric</th><th>Definition</th></tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td>Precision (Postive Predictive Value)</td>
+ <td>$PPV=\frac{TP}{TP + FP}$</td>
+ </tr>
+ <tr>
+ <td>Recall (True Positive Rate)</td>
+ <td>$TPR=\frac{TP}{P}=\frac{TP}{TP + FN}$</td>
+ </tr>
+ <tr>
+ <td>F-measure</td>
+ <td>$F(\beta) = \left(1 + \beta^2\right) \cdot \left(\frac{PPV \cdot
TPR}
+ {\beta^2 \cdot PPV + TPR}\right)$</td>
+ </tr>
+ <tr>
+ <td>Receiver Operating Characteristic (ROC)</td>
+ <td>$FPR(T)=\int^\infty_{T} P_0(T)\,dT \\ TPR(T)=\int^\infty_{T}
P_1(T)\,dT$</td>
+ </tr>
+ <tr>
+ <td>Area Under ROC Curve</td>
+ <td>$AUROC=\int^1_{0} \frac{TP}{P} d\left(\frac{FP}{N}\right)$</td>
+ </tr>
+ <tr>
+ <td>Area Under Precision-Recall Curve</td>
+ <td>$AUPRC=\int^1_{0} \frac{TP}{TP+FP}
d\left(\frac{TP}{P}\right)$</td>
+ </tr>
+ </tbody>
+</table>
+
+
+**Examples**
+
+<div class="codetabs">
+The following code snippets illustrate how to load a sample dataset, train
a binary classification algorithm on the
+data, and evaluate the performance of the algorithm by several binary
evaluation metrics.
+
+<div data-lang="scala" markdown="1">
+
+{% highlight scala %}
+import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
+import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.util.MLUtils
+
+// Load training data in LIBSVM format
+val data = MLUtils.loadLibSVMFile(sc,
"data/mllib/sample_binary_classification_data.txt")
+
+// Split data into training (60%) and test (40%)
+val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
--- End diff --
Fixed this in python and scala.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]