Github user takuti commented on the issue: https://github.com/apache/incubator-hivemall/pull/79 I tested generic classifier and regressor on EMR by using the a9a data. ### Classifier ``` set hivevar:n_samples=16281; set hivevar:total_steps=32562; ``` #### `logress` ```sql drop table if exists logress_model; create table logress_model as select feature, avg(weight) as weight from ( select logress(add_bias(features), label, '-total_steps ${total_steps}') as (feature, weight) -- logress(add_bias(features), label, '-total_steps ${total_steps} -mini_batch 10') as (feature, weight) from train_x3 ) t group by feature; ``` ```sql WITH test_exploded as ( select rowid, label, extract_feature(feature) as feature, extract_weight(feature) as value from test LATERAL VIEW explode(add_bias(features)) t AS feature ), predict as ( select t.rowid, sigmoid(sum(m.weight * t.value)) as prob, CAST((case when sigmoid(sum(m.weight * t.value)) >= 0.5 then 1.0 else 0.0 end) as FLOAT) as label from test_exploded t LEFT OUTER JOIN logress_model m ON (t.feature = m.feature) group by t.rowid ), submit as ( select t.label as actual, pd.label as predicted, pd.prob as probability from test t JOIN predict pd on (t.rowid = pd.rowid) ) select count(1) / ${n_samples} from submit where actual = predicted; ``` #### `train_classifier` ```sql train_classifier(add_bias(features), label, '-loss logloss -opt SGD -reg no -eta simple -total_steps ${total_steps}') as (feature, weight) -- train_classifier(add_bias(features), label, '-loss logloss -opt SGD -reg no -eta simple -total_steps ${total_steps} -mini_batch 10') as (feature, weight) ``` Results were completely same: | | online | mini-batch | |:--|:--:|:--:| |`logress`| 0.8414716540753026 | 0.848965051286776 | |`train_classifier`| 0.8414716540753026 | 0.848965051286776 | ### Regression Solved the a9a label prediction as a regression problem. // Since non-generic Adagrad was designed for logistic loss (i.e. classification), we cannot compare it with generic regressor under the exactly same condition. #### `train_adagrad_regr` (internally uses logistic loss) ```sql drop table if exists adagrad_model; create table adagrad_model as select feature, avg(weight) as weight from ( select train_adagrad_regr(features, label) as (feature, weight) from train_x3 ) t group by feature; ``` ```sql WITH test_exploded as ( select rowid, label, extract_feature(feature) as feature, extract_weight(feature) as value from test LATERAL VIEW explode(add_bias(features)) t AS feature ), predict as ( select t.rowid, sigmoid(sum(m.weight * t.value)) as prob from test_exploded t LEFT OUTER JOIN adagrad_model m ON (t.feature = m.feature) group by t.rowid ), submit as ( select t.label as actual, pd.prob as probability from test t JOIN predict pd on (t.rowid = pd.rowid) ) select rmse(probability, actual) from submit; ``` ### `train_regression` ```sql train_regression(features, label, '-loss squaredloss -opt AdaGrad -reg no') as (feature, weight) -- train_regression(features, label, '-loss squaredloss -opt AdaGrad -reg no -mini_batch 10') as (feature, weight) ``` | | online | mini-batch | |:--|:--:|:--:| |`train_adagrad_regr` (logistic loss) | 0.3254586866367811 | -- | |`train_regression` (squared loss) | 0.3356422627079689 | 0.3348889704327727 | As I mentioned in the last comment, I'm afraid whether the `-mini_batch` option works correctly for Adagrad. Fortunately, this example showed that the option slightly improved the accuracy of prediction in terms of RMSE.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---