Frank McQuillan created MADLIB-1250:
---------------------------------------
Summary: Can't generate cross validation table for SVM
Key: MADLIB-1250
URL: https://issues.apache.org/jira/browse/MADLIB-1250
Project: Apache MADlib
Issue Type: Bug
Components: Module: Support Vector Machines
Reporter: Frank McQuillan
Fix For: v1.15
SVM does provide the CV:
1) The CV results table can be obtained by setting the validation_result
variable in params parameter. This can be any arbitrary name, including
<output_table>_cv.
2) The _summary table reports the best cross-validated parameter, which
corresponds to the model in the output table. This gives the user the exact
parameters to recreate the model. It's open for debate if that is the purpose
of the summary table.
3) The docs are definitely missing examples for CV.
But there seems to be a bug:
{code}
DROP TABLE IF EXISTS houses;
CREATE TABLE houses (id INT, tax INT, bedroom INT, bath FLOAT, price INT,
size INT, lot INT);
INSERT INTO houses VALUES
(1 , 590 , 2 , 1 , 50000 , 770 , 22100),
(2 , 1050 , 3 , 2 , 85000 , 1410 , 12000),
(3 , 20 , 3 , 1 , 22500 , 1060 , 3500),
(4 , 870 , 2 , 2 , 90000 , 1300 , 17500),
(5 , 1320 , 3 , 2 , 133000 , 1500 , 30000),
(6 , 1350 , 2 , 1 , 90500 , 820 , 25700),
(7 , 2790 , 3 , 2.5 , 260000 , 2130 , 25000),
(8 , 680 , 2 , 1 , 142500 , 1170 , 22000),
(9 , 1840 , 3 , 2 , 160000 , 1500 , 19000),
(10 , 3680 , 4 , 2 , 240000 , 2790 , 20000),
(11 , 1660 , 3 , 1 , 87000 , 1030 , 17500),
(12 , 1620 , 3 , 2 , 118600 , 1250 , 20000),
(13 , 3100 , 3 , 2 , 140000 , 1760 , 38000),
(14 , 2070 , 2 , 3 , 148000 , 1550 , 14000),
(15 , 650 , 3 , 1.5 , 65000 , 1450 , 12000);
{code}
Run training with CV:
{code}
DROP TABLE IF EXISTS houses_svm_gaussian_regression,
houses_svm_gaussian_regression_summary, houses_svm_gaussian_regression_random,
houses_svm_gaussian_regression_cv;
SELECT madlib.svm_regression( 'houses',
'houses_svm_gaussian_regression',
'price',
'ARRAY[1, tax, bath, size]',
'gaussian',
'n_components=10',
'',
'init_stepsize=[0.01, 1], max_iter=200,
validation_result=houses_svm_gaussian_regression_cv, n_folds=3'
);
SELECT * FROM houses_svm_gaussian_regression_cv;
{code}
Results in error:
{code}
InternalError: (psycopg2.InternalError) KeyError: 'params_dict'
(plpython.c:4960)
CONTEXT: Traceback (most recent call last):
PL/Python function "svm_regression", line 23, in <module>
return svm.svm(**globals())
PL/Python function "svm_regression", line 970, in svm
PL/Python function "svm_regression", line 1033, in _cross_validate_svm
PL/Python function "svm_regression", line 146, in output_tbl
PL/Python function "svm_regression"
[SQL: "SELECT madlib.svm_regression( 'houses',\n
'houses_svm_gaussian_regression',\n 'price',\n
'ARRAY[1, tax, bath, size]',\n
'gaussian',\n 'n_components=10',\n
'',\n 'init_stepsize=[0.01, 1],
max_iter=200, validation_result=houses_svm_gaussian_regression_cv, n_folds=3'\n
);"]
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)