[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...

takuti Mon, 22 May 2017 03:11:26 -0700

Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/79
  
    I tested generic classifier and regressor on EMR by using the a9a data.
    
    ### Classifier
    
    ```
    set hivevar:n_samples=16281;
    set hivevar:total_steps=32562;
    ```
    
    #### `logress`
    
    ```sql
    drop table if exists logress_model;
    create table logress_model as
    select
     feature,
     avg(weight) as weight
    from
     (
      select
         logress(add_bias(features), label, '-total_steps ${total_steps}') as 
(feature, weight)
         -- logress(add_bias(features), label, '-total_steps ${total_steps} 
-mini_batch 10') as (feature, weight)
      from
         train_x3
     ) t
    group by feature;
    ```
    
    ```sql
    WITH test_exploded as (
      select
        rowid,
        label,
        extract_feature(feature) as feature,
        extract_weight(feature) as value
      from
        test LATERAL VIEW explode(add_bias(features)) t AS feature
    ),
    predict as (
      select
        t.rowid,
        sigmoid(sum(m.weight * t.value)) as prob,
        CAST((case when sigmoid(sum(m.weight * t.value)) >= 0.5 then 1.0 else 
0.0 end) as FLOAT) as label
      from
        test_exploded t LEFT OUTER JOIN
        logress_model m ON (t.feature = m.feature)
      group by
        t.rowid
    ),
    submit as (
      select
        t.label as actual,
        pd.label as predicted,
        pd.prob as probability
      from
        test t JOIN predict pd
          on (t.rowid = pd.rowid)
    )
    select count(1) / ${n_samples} from submit
    where actual = predicted;
    ```
    
    #### `train_classifier`
    
    ```sql
    train_classifier(add_bias(features), label, '-loss logloss -opt SGD -reg no 
-eta simple -total_steps ${total_steps}') as (feature, weight)
    -- train_classifier(add_bias(features), label, '-loss logloss -opt SGD -reg 
no -eta simple -total_steps ${total_steps} -mini_batch 10') as (feature, weight)
    ```
    
    Results were completely same:
    
    | | online | mini-batch |
    |:--|:--:|:--:|
    |`logress`| 0.8414716540753026 | 0.848965051286776 |
    |`train_classifier`| 0.8414716540753026 | 0.848965051286776 |
    
    ### Regression
    
    Solved the a9a label prediction as a regression problem. 
    
    // Since non-generic Adagrad was designed for logistic loss (i.e. 
classification), we cannot compare it with generic regressor under the exactly 
same condition.
    
    #### `train_adagrad_regr` (internally uses logistic loss)
    
    ```sql
    drop table if exists adagrad_model;
    create table adagrad_model as
    select
     feature,
     avg(weight) as weight
    from
     (
      select
         train_adagrad_regr(features, label) as (feature, weight)
      from
         train_x3
     ) t
    group by feature;
    ```
    
    ```sql
    WITH test_exploded as (
      select
        rowid,
        label,
        extract_feature(feature) as feature,
        extract_weight(feature) as value
      from
        test LATERAL VIEW explode(add_bias(features)) t AS feature
    ),
    predict as (
      select
        t.rowid,
        sigmoid(sum(m.weight * t.value)) as prob
      from
        test_exploded t LEFT OUTER JOIN
        adagrad_model m ON (t.feature = m.feature)
      group by
        t.rowid
    ),
    submit as (
      select
        t.label as actual,
        pd.prob as probability
      from
        test t JOIN predict pd
          on (t.rowid = pd.rowid)
    )
    select rmse(probability, actual) from submit;
    ```
    
    ### `train_regression`
    
    ```sql
    train_regression(features, label, '-loss squaredloss -opt AdaGrad -reg no') 
as (feature, weight)
    -- train_regression(features, label, '-loss squaredloss -opt AdaGrad -reg 
no -mini_batch 10') as (feature, weight)
    ```
    
    | | online | mini-batch |
    |:--|:--:|:--:|
    |`train_adagrad_regr` (logistic loss) | 0.3254586866367811 | -- |
    |`train_regression` (squared loss) | 0.3356422627079689 | 
0.3348889704327727 |
    
    As I mentioned in the last comment, I'm afraid whether the `-mini_batch` 
option works correctly for Adagrad. Fortunately, this example showed that the 
option slightly improved the accuracy of prediction in terms of RMSE.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hivemall issue #79: [WIP][HIVEMALL-101] Separate optimizer impleme...

Reply via email to