GitHub user takuti opened a pull request:
https://github.com/apache/incubator-hivemall/pull/52
[HIVEMALL-78] Implement AUC UDAF for binary classification
## What changes were proposed in this pull request?
In addition to current `auc(array, array)` for ranking (myui/hivemall#326),
this patch supports `auc(double, double)` for binary classification.
## What type of PR is it?
Feature
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-78
## How was this patch tested?
Created unit test for the UDAF, and passed:
```
$ mvn -Dtest=hivemall.evaluation.AUCUDAFTest test
```
Moreover, I have launched manual tests by the following queries:
```sql
with data as (
select 0.5 as prob, 0 as label
union all
select 0.3 as prob, 1 as label
union all
select 0.2 as prob, 0 as label
union all
select 0.8 as prob, 1 as label
union all
select 0.7 as prob, 1 as label
), data_ordered as (
select prob, label
from data
order by prob desc
)
select auc(prob, label)
from (
select prob, label
from data_ordered
distribute by floor(prob / 0.2)
) t;
```
```sql
with data as (
select 0.5 as prob, 0 as label
union all
select 0.3 as prob, 1 as label
union all
select 0.2 as prob, 0 as label
union all
select 0.8 as prob, 1 as label
union all
select 0.7 as prob, 1 as label
), data_ordered as (
select prob, label
from data
order by prob desc
)
select auc(prob, label)
from data_ordered;
```
Both showed `AUC=0.83333`. This result is same as [scikit-learn's
roc_auc_score()](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html):
```
>>> roc_auc_score([0,1,0,1,1],[0.5,0.3,0.2,0.8,0.7])
0.83333333333333326
```
## How to use this feature?
See above queries. Input data needs to be ordered by scores in a descending
order.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/takuti/incubator-hivemall auc
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-hivemall/pull/52.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #52
----
commit e60ff231e07aa515666ec7f4863ed1c8401e0e27
Author: Takuya Kitazawa <[email protected]>
Date: 2017-02-28T06:08:33Z
Implement AUCUDAF
commit 4756f463700740af0bd51ab7a25e383649a2d504
Author: Takuya Kitazawa <[email protected]>
Date: 2017-02-28T06:09:18Z
Add unit test of AUCUDAF for classification
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---