Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/7655#discussion_r35829175
  
    --- Diff: docs/mllib-evaluation-metrics.md ---
    @@ -0,0 +1,1476 @@
    +---
    +layout: global
    +title: Evaluation Metrics - MLlib
    +displayTitle: <a href="mllib-guide.html">MLlib</a> - Evaluation Metrics
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +
    +## Algorithm Metrics
    +
    +Spark's MLlib comes with a number of machine learning algorithms that can 
be used to learn from and make predictions
    +on data. When these algorithms are applied to build machine learning 
models, there is a need to evaluate the performance
    +of the model on some criteria, which depends on the application and its 
requirements. Spark's MLlib also provides a
    +suite of metrics for the purpose of evaluating the performance of machine 
learning models.
    +
    +Specific machine learning algorithms fall under broader types of machine 
learning applications like classification,
    +regression, clustering, etc. Each of these types have well established 
metrics for performance evaluation and those
    +metrics that are currently available in Spark's MLlib are detailed in this 
section.
    +
    +## Classification Model Evaluation
    +
    +While there are many different types of classification algorithms, the 
evaluation of classification models all share
    +similar principles. In a [supervised classification 
problem](https://en.wikipedia.org/wiki/Statistical_classification),
    +there exists a true output and a model-generated predicted output for each 
data point. For this reason, the results for
    +each data point can be assigned to one of four categories:
    +
    +* True Positive (TP) - class predicted by model and class in true output
    +* True Negative (TN) - class not predicted by model and class not in true 
output
    +* False Positive (FP) - class predicted by model and class not in true 
output
    +* False Negative (FN) - class not predicted by model and class in true 
output
    +
    +These four numbers are the building blocks for most classifier evaluation 
metrics. A fundamental point when considering
    +classifier evaluation is that pure accuracy (i.e. was the prediction 
correct or incorrect) is not generally a good metric. The
    +reason for this is because a dataset may be highly unbalanced. For 
example, if a model is designed to predict fraud from
    +a dataset where 95% of the data points are _not fraud_ and 5% of the data 
points are _fraud_, then a naive classifier
    +that predicts _not fraud_, regardless of input, will be 95% accurate. For 
this reason, metrics like
    +[precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) 
are typically used because they take into
    +account the *type* of error. In most applications there is some desired 
balance between precision and recall, which can
    +be captured by combining the two into a single metric, called the 
[F-measure](https://en.wikipedia.org/wiki/F1_score).
    +
    +### Binary Classification
    +
    +[Binary classifiers](https://en.wikipedia.org/wiki/Binary_classification) 
are used to separate the elements of a given
    +dataset into one of two possible groups (e.g. fraud or not fraud) and is a 
special case of multiclass classification.
    +Most binary classification metrics can be generalized to multiclass 
classification metrics.
    +
    +#### Threshold Tuning
    +
    +It is import to understand that many classification models actually output 
a "score" (often times a probability) for
    +each class, where a higher score indicates higher likelihood. In the 
binary case, the model may output a probability for
    +each class: $P(Y=1|X)$ and $P(Y=0|X)$. Instead of simply taking the higher 
probability, there may be some cases where
    +the model might need to be tuned so that it only predicts a class when the 
probability is very high (e.g. only block a
    +credit card transaction if the model predicts fraud with >90% 
probability). Therefore, there is a prediction *threshold*
    +which determines what the predicted class will be based on the 
probabilities that the model outputs.
    +
    +Tuning the prediction threshold will change the precision and recall of 
the model and is an important part of model
    +optimization. In order to visualize how precision, recall, and other 
metrics change as a function of the threshold it is
    +common practice to plot competing metrics against one another, 
parameterized by threshold. A P-R curve plots (precision,
    +recall) points for different threshold values, while a [receiver operating 
characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic),
 or ROC,
    +curve plots (recall, false positive rate) points.
    +
    +**Available Metrics**
    +
    +<table class="table">
    +  <thead>
    +    <tr><th>Metric</th><th>Definition</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>Precision (Postive Predictive Value)</td>
    +      <td>$PPV=\frac{TP}{TP + FP}$</td>
    +    </tr>
    +    <tr>
    +      <td>Recall (True Positive Rate)</td>
    +      <td>$TPR=\frac{TP}{P}=\frac{TP}{TP + FN}$</td>
    +    </tr>
    +    <tr>
    +      <td>F-measure</td>
    +      <td>$F(\beta) = \left(1 + \beta^2\right) \cdot \left(\frac{PPV \cdot 
TPR}
    +          {\beta^2 \cdot PPV + TPR}\right)$</td>
    +    </tr>
    +    <tr>
    +      <td>Receiver Operating Characteristic (ROC)</td>
    +      <td>$FPR(T)=\int^\infty_{T} P_0(T)\,dT \\ TPR(T)=\int^\infty_{T} 
P_1(T)\,dT$</td>
    +    </tr>
    +    <tr>
    +      <td>Area Under ROC Curve</td>
    +      <td>$AUROC=\int^1_{0} \frac{TP}{P} d\left(\frac{FP}{N}\right)$</td>
    +    </tr>
    +    <tr>
    +      <td>Area Under Precision-Recall Curve</td>
    +      <td>$AUPRC=\int^1_{0} \frac{TP}{TP+FP} 
d\left(\frac{TP}{P}\right)$</td>
    +    </tr>
    +  </tbody>
    +</table>
    +
    +
    +**Examples**
    +
    +<div class="codetabs">
    +The following code snippets illustrate how to load a sample dataset, train 
a binary classification algorithm on the
    +data, and evaluate the performance of the algorithm by several binary 
evaluation metrics.
    +
    +<div data-lang="scala" markdown="1">
    +
    +{% highlight scala %}
    +import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
    +import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.util.MLUtils
    +
    +// Load training data in LIBSVM format
    +val data = MLUtils.loadLibSVMFile(sc, 
"data/mllib/sample_binary_classification_data.txt")
    +
    +// Split data into training (60%) and test (40%)
    +val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
    --- End diff --
    
    Fixed this in python and scala.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to