Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/7655#discussion_r35622323
  
    --- Diff: docs/mllib-evaluation-metrics.md ---
    @@ -0,0 +1,1475 @@
    +---
    +layout: global
    +title: Evaluation Metrics - MLlib
    +displayTitle: <a href="mllib-guide.html">MLlib</a> - Evaluation Metrics
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +
    +## Algorithm Metrics
    +
    +Spark's MLlib comes with a number of machine learning algorithms that can 
be used to learn from and make predictions
    +on data. When these algorithms are applied to build machine learning 
models, there is a need to evaluate the performance
    +of the model on some criteria, which depends on the application and its 
requirements. Spark's MLlib also provides a
    +suite of metrics for the purpose of evaluating the performance of machine 
learning models.
    +
    +Specific machine learning algorithms fall under broader types of machine 
learning applications like classification,
    +regression, clustering, etc. Each of these types have well established 
metrics for performance evaluation and those
    +metrics that are currently available in Spark's MLlib are detailed in this 
section.
    +
    +## Classification Model Evaluation
    +
    +While there are many different types of classification algorithms, the 
evaluation of classification models all share
    +similar principles. In a [supervised classification 
problem](https://en.wikipedia.org/wiki/Statistical_classification),
    +there exists a true output and a model-generated predicted output for each 
data point. For this reason, the results for
    +each data point can be assigned to one of four categories:
    +
    +* True Positive (TP) - class predicted by model and class in true output
    +* True Negative (TN) - class not predicted by model and class not in true 
output
    +* False Positive (FP) - class predicted by model and class not in true 
output
    +* False Negative (FN) - class not predicted by model and class in true 
output
    +
    +These four numbers are the building blocks for most classifier evaluation 
metrics. A fundamental point when considering
    +classifier evaluation is that pure accuracy (i.e. was the prediction 
correct or incorrect) is not generally a good metric. The
    +reason for this is because a dataset may be highly unbalanced. For 
example, if a model is designed to predict fraud from
    +a dataset where 95% of the data points are _not fraud_ and 5% of the data 
points are _fraud_, then a naive classifier
    +that predicts _not fraud_, regardless of input, will be 95% accurate. For 
this reason, metrics like
    +[precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) 
are typically used because they take into
    +account the *type* of error. In most applications there is some desired 
balance between precision and recall, which can
    +be captured by combining the two into a single metric, called the 
[F-measure](https://en.wikipedia.org/wiki/F1_score).
    +
    +### Binary Classification
    +
    +[Binary classifiers](https://en.wikipedia.org/wiki/Binary_classification) 
are used to separate the elements of a given
    +dataset into one of two possible groups (e.g. fraud or not fraud) and is a 
special case of multiclass classification.
    +Most binary classification metrics can be generalized to multiclass 
classification metrics.
    +
    +#### Threshold Tuning
    +
    +It is important to understand that, in most classification models, the 
output of the model is actually a probability or
    +set of probabilities, which are then converted to predictions. In the 
binary case, the model outputs a probability for
    +each class: $P(Y=1|X)$ and $P(Y=0|X)$. However, there may be some cases 
where the model might need to be tuned so that
    +it only predicts a class when the probability is very high (e.g. only 
block a credit card transaction if the model
    +predicts fraud with >90% probability). Therefore, there is a prediction 
*threshold* which determines what the predicted
    +class will be based on the probabilities that the model outputs.
    +
    +Tuning the prediction threshold will change the precision and
    +recall of the model and is an important part of model optimization. In 
order to visualize how precision, recall,
    +and other metrics change as a function of the threshold it is common 
practice to plot competing metrics against one
    +another, parameterized by threshold. A P-R curve plots (precision, recall) 
points for different threshold values,
    +while a [receiver operating 
characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic),
 or ROC,
    +curve plots (recall, false positive rate) points.
    +
    +**Available Metrics**
    +
    +<table class="table">
    +  <thead>
    +    <tr><th>Metric</th><th>Definition</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>Precision (Postive Predictive Value)</td>
    +      <td>$PPV=\frac{TP}{TP + FP}$</td>
    +    </tr>
    +    <tr>
    +      <td>Recall (True Positive Rate)</td>
    +      <td>$TPR=\frac{TP}{P}=\frac{TP}{TP + FN}$</td>
    +    </tr>
    +    <tr>
    +      <td>F-measure</td>
    +      <td>$F(\beta) = \left(1 + \beta^2\right) \cdot \left(\frac{PPV \cdot 
TPR}
    +          {\beta^2 \cdot PPV + TPR}\right)$</td>
    +    </tr>
    +    <tr>
    +      <td>Receiver Operating Characteristic (ROC)</td>
    +      <td>$FPR(T)=\int^\infty_{T} P_0(T)\,dT \\ TPR(T)=\int^\infty_{T} 
P_1(T)\,dT$</td>
    +    </tr>
    +    <tr>
    +      <td>Area Under ROC Curve</td>
    +      <td>$AUROC=\int^1_{0} \frac{TP}{P} d\left(\frac{FP}{N}\right)$</td>
    +    </tr>
    +    <tr>
    +      <td>Area Under Precision-Recall Curve</td>
    +      <td>$AUPRC=\int^1_{0} \frac{TP}{TP+FP} 
d\left(\frac{TP}{P}\right)$</td>
    +    </tr>
    +  </tbody>
    +</table>
    +
    +
    +**Examples**
    +
    +<div class="codetabs">
    +The following code snippets illustrate how to load a sample dataset, train 
a binary classification algorithm on the
    +data, and evaluate the performance of the algorithm by several binary 
evaluation metrics.
    +
    +<div data-lang="scala" markdown="1">
    +
    +{% highlight scala %}
    +import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
    +import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.util.MLUtils
    +
    +// Load training data in LIBSVM format
    +val data = MLUtils.loadLibSVMFile(sc, 
"data/mllib/sample_binary_classification_data.txt")
    +
    +// Split data into training (60%) and test (40%)
    +val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
    +val training = splits(0).cache()
    +val test = splits(1)
    +
    +// Run training algorithm to build the model
    +val model = new LogisticRegressionWithLBFGS()
    +  .setNumClasses(2)
    +  .run(training)
    +
    +// Clear the prediction threshold so the model will return probabilities
    +model.clearThreshold
    +
    +// Compute raw scores on the test set
    +val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
    +  val prediction = model.predict(features)
    +  (prediction, label)
    +}
    +
    +// Instantiate metrics object
    +val metrics = new BinaryClassificationMetrics(predictionAndLabels)
    +
    +// Precision by threshold
    +val precision = metrics.precisionByThreshold
    +precision.foreach(x => printf("Threshold: %1.2f, Precision: %1.2f\n", 
x._1, x._2))
    +
    +// Recall by threshold
    +val recall = metrics.precisionByThreshold
    +recall.foreach(x => printf("Threshold: %1.2f, Recall: %1.2f\n", x._1, 
x._2))
    +
    +// Precision-Recall Curve
    +val PRC = metrics.pr
    +
    +// F-measure
    +val f1Score = metrics.fMeasureByThreshold
    +f1Score.foreach(x => printf("Threshold: %1.2f, F-score: %1.2f, Beta = 
1\n", x._1, x._2))
    +
    +val beta = 0.5
    +val fScore = metrics.fMeasureByThreshold(beta)
    +fScore.foreach(x => printf("Threshold: %1.2f, F-score: %1.2f, Beta = 
0.5\n", x._1, x._2))
    +
    +// AUPRC
    +val auPRC = metrics.areaUnderPR
    +println("Area under precision-recall curve = " + auPRC)
    +
    +// Compute thresholds used in ROC and PR curves
    +val thresholds = precision.map(_._1)
    +
    +// ROC Curve
    +val roc = metrics.roc
    +
    +// AUROC
    +val auROC = metrics.areaUnderROC
    +println("Area under ROC = " + auROC)
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +{% highlight java %}
    +import scala.Tuple2;
    +
    +import org.apache.spark.api.java.*;
    +import org.apache.spark.rdd.RDD;
    +import org.apache.spark.api.java.function.Function;
    +import org.apache.spark.mllib.classification.LogisticRegressionModel;
    +import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS;
    +import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics;
    +import org.apache.spark.mllib.regression.LabeledPoint;
    +import org.apache.spark.mllib.util.MLUtils;
    +import org.apache.spark.SparkConf;
    +import org.apache.spark.SparkContext;
    +
    +public class BinaryClassification {
    +  public static void main(String[] args) {
    +    SparkConf conf = new SparkConf().setAppName("Binary Classification 
Metrics");
    +    SparkContext sc = new SparkContext(conf);
    +    String path = "data/mllib/sample_binary_classification_data.txt";
    +    JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc, 
path).toJavaRDD();
    +
    +    // Split initial RDD into two... [60% training data, 40% testing data].
    +    JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[] {0.6, 
0.4}, 11L);
    +    JavaRDD<LabeledPoint> training = splits[0].cache();
    +    JavaRDD<LabeledPoint> test = splits[1];
    +
    +    // Run training algorithm to build the model.
    +    final LogisticRegressionModel model = new LogisticRegressionWithLBFGS()
    +      .setNumClasses(3)
    +      .run(training.rdd());
    +
    +    // Compute raw scores on the test set.
    +    JavaRDD<Tuple2<Object, Object>> predictionAndLabels = test.map(
    +      new Function<LabeledPoint, Tuple2<Object, Object>>() {
    +        public Tuple2<Object, Object> call(LabeledPoint p) {
    +          Double prediction = model.predict(p.features());
    +          return new Tuple2<Object, Object>(prediction, p.label());
    +        }
    +      }
    +    );
    +
    +    // Get evaluation metrics.
    +    BinaryClassificationMetrics metrics = new 
BinaryClassificationMetrics(predictionAndLabels.rdd());
    +
    +    // Precision by threshold
    +    JavaRDD<Tuple2<Object, Object>> precision = 
metrics.precisionByThreshold().toJavaRDD();
    +    System.out.println("Precision by threshold: " + precision.toArray());
    +
    +    // Recall by threshold
    +    JavaRDD<Tuple2<Object, Object>> recall = 
metrics.recallByThreshold().toJavaRDD();
    +    System.out.println("Recall by threshold: " + recall.toArray());
    +
    +    // F Score by threshold
    +    JavaRDD<Tuple2<Object, Object>> f1Score = 
metrics.fMeasureByThreshold().toJavaRDD();
    +    System.out.println("F1 Score by threshold: " + f1Score.toArray());
    +
    +    JavaRDD<Tuple2<Object, Object>> f2Score = 
metrics.fMeasureByThreshold(2.0).toJavaRDD();
    +    System.out.println("F2 Score by threshold: " + f2Score.toArray());
    +
    +    // Precision-recall curve
    +    JavaRDD<Tuple2<Object, Object>> prc = metrics.pr().toJavaRDD();
    +    System.out.println("Precision-recall curve: " + prc.toArray());
    +
    +    // Thresholds
    +    JavaRDD<Double> thresholds = precision.map(
    +      new Function<Tuple2<Object, Object>, Double>() {
    +        public Double call (Tuple2<Object, Object> t) {
    +          return new Double(t._1().toString());
    +        }
    +      }
    +    );
    +
    +    // ROC Curve
    +    JavaRDD<Tuple2<Object, Object>> roc = metrics.roc().toJavaRDD();
    +    System.out.println("ROC curve: " + roc.toArray());
    +
    +    // AUPRC
    +    System.out.println("Area under precision-recall curve = " + 
metrics.areaUnderPR());
    +
    +    // AUROC
    +    System.out.println("Area under ROC = " + metrics.areaUnderROC());
    +
    +    // Save and load model
    +    model.save(sc, "myModelPath");
    +    LogisticRegressionModel sameModel = LogisticRegressionModel.load(sc, 
"myModelPath");
    +  }
    +}
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +
    +{% highlight python %}
    +from pyspark.mllib.classification import LogisticRegressionWithLBFGS
    +from pyspark.mllib.evaluation import BinaryClassificationMetrics
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.mllib.util import MLUtils
    +
    +# Several of the methods available in scala are currently missing from 
pyspark
    +
    +# Load training data in LIBSVM format
    +data = MLUtils.loadLibSVMFile(sc, 
"data/mllib/sample_binary_classification_data.txt")
    +
    +# Split data into training (60%) and test (40%)
    +splits = data.randomSplit([0.6, 0.4], seed = 11L)
    +training = splits[0].cache()
    +test = splits[1]
    +
    +# Run training algorithm to build the model
    +model = LogisticRegressionWithLBFGS.train(training)
    +
    +# Compute raw scores on the test set
    +predictionAndLabels = test.map(lambda lp: 
(float(model.predict(lp.features)), lp.label))
    +
    +# Instantiate metrics object
    +metrics = BinaryClassificationMetrics(predictionAndLabels)
    +
    +# Area under precision-recall curve
    +print "Area under PR = %1.2f" % metrics.areaUnderPR
    +
    +# Area under ROC curve
    +print "Area under ROC = %1.2f" % metrics.areaUnderROC
    +
    +{% endhighlight %}
    +
    +</div>
    +</div>
    +
    +
    +### Multiclass Classification
    +
    +A [multiclass 
classification](https://en.wikipedia.org/wiki/Multiclass_classification) 
describes a classification
    +problem where there are $M \gt 2$ possible labels for each data point (the 
case where $M=2$ is the binary
    +classification problem). For example, classifying handwriting samples to 
the digits 0 to 9, having 10 possible classes.
    +
    +#### Label based metrics
    +
    +Opposed to binary classification where there are only two possible labels, 
multiclass classification problems have many
    +possible labels and so the concept of label-based metrics is introduced. 
Overall precision measures precision across all
    +labels -  the number of times any class was predicted correctly (true 
positives) normalized by the number of data
    --- End diff --
    
    Is this really called "precision" and not just "accuracy"? fine if it is, 
it was just unfamiliar to me. I understand one-label precision, yes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to