[GitHub] incubator-madlib pull request #69: Elastic Net: Add grouping support

iyerr3 Mon, 07 Nov 2016 16:45:05 -0800

Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/69#discussion_r86891157
  
    --- Diff: src/ports/postgres/modules/elastic_net/elastic_net.py_in ---
    @@ -34,220 +41,345 @@ def elastic_net_help(schema_madlib, 
family_or_optimizer=None, **kwargs):
         if (family_or_optimizer is None or
                 family_or_optimizer.lower() in ("help", "?")):
             return """
    -        ----------------------------------------------------------------
    -                                Summary
    -        ----------------------------------------------------------------
    -        Right now, gaussian (linear) and binomial (logistic) families
    -        are supported!
    -        --
    -        Run:
    -        SELECT {schema_madlib}.elastic_net_train('gaussian');
    -        or
    -        SELECT {schema_madlib}.elastic_net_train('binomial');
    -        to see more help.
    -        --
    -        Run:  SELECT {schema_madlib}.elastic_net_train('usage');
    -        to see how to use.
    -        --
    -        Run:  SELECT {schema_madlib}.elastic_net_train('predict');
    -        to see how to predict.
    +    ----------------------------------------------------------------
    +                            Summary
    +    ----------------------------------------------------------------
    +    Right now, gaussian (linear) and binomial (logistic) families
    +    are supported!
    +    --
    +    Run:
    +    SELECT {schema_madlib}.elastic_net_train('gaussian');
    +    or
    +    SELECT {schema_madlib}.elastic_net_train('binomial');
    +    to see more help.
    +    --
    +    Run:  SELECT {schema_madlib}.elastic_net_train('usage');
    +    to see how to use.
    +    --
    +    Run:  SELECT {schema_madlib}.elastic_net_train('predict');
    +    to see how to predict.
    +    --
    +    Run:  SELECT {schema_madlib}.elastic_net_train('example');
    +    to see some examples.
             """.format(schema_madlib=schema_madlib)
     
    -    if (family_or_optimizer.lower() == "usage"):
    +    if (family_or_optimizer.lower() in ('example', 'examples')):
    +        return """
    +    ----------------------------------------------------------------
    +                            EXAMPLE
    +    ----------------------------------------------------------------
    +    Create an input data set:
    +    DROP TABLE IF EXISTS houses;
    +    CREATE TABLE houses ( id INT,
    +                          tax INT,
    +                          bedroom INT,
    +                          bath FLOAT,
    +                          price INT,
    +                          size INT,
    +                          lot INT,
    +                          grp_by_col INT);
    +    INSERT INTO houses VALUES
    +    (1,  590, 2,   1,  50000,  770, 22100, 1),
    +    (2, 1050, 3,   2,  85000, 1410, 12000, 1),
    +    (3,   20, 3,   1,  22500, 1060,  3500, 1),
    +    (4,  870, 2,   2,  90000, 1300, 17500, 1),
    +    (5, 1320, 3,   2, 133000, 1500, 30000, 1),
    +    (6, 1350, 2,   1,  90500,  820, 25700, 1),
    +    (7, 2790, 3, 2.5, 260000, 2130, 25000, 1),
    +    (8,  680, 2,   1, 142500, 1170, 22000, 1),
    +    (9, 1840, 3,   2, 160000, 1500, 19000, 1),
    +    (10, 3680, 4,   2, 240000, 2790, 20000, 1),
    +    (11, 1660, 3,   1,  87000, 1030, 17500, 1),
    +    (12, 1620, 3,   2, 118600, 1250, 20000, 1),
    +    (13, 3100, 3,   2, 140000, 1760, 38000, 1),
    +    (14, 2070, 2,   3, 148000, 1550, 14000, 1),
    +    (15,  650, 3, 1.5,  65000, 1450, 12000, 1),
    +    (16,  770, 2,   2,  91000, 1300, 17500, 2),
    +    (17, 1220, 3,   2, 132300, 1500, 30000, 2),
    +    (18, 1150, 2,   1,  91100,  820, 25700, 2),
    +    (19, 2690, 3, 2.5, 260011, 2130, 25000, 2),
    +    (20,  780, 2,   1, 141800, 1170, 22000, 2),
    +    (21, 1910, 3,   2, 160900, 1500, 19000, 2),
    +    (22, 3600, 4,   2, 239000, 2790, 20000, 2),
    +    (23, 1600, 3,   1,  81010, 1030, 17500, 2),
    +    (24, 1590, 3,   2, 117910, 1250, 20000, 2),
    +    (25, 3200, 3,   2, 141100, 1760, 38000, 2),
    +    (26, 2270, 2,   3, 148011, 1550, 14000, 2),
    +    (27,  750, 3, 1.5,  66000, 1450, 12000, 2);
    +
    +    Train a model:
    +    DROP TABLE IF EXISTS houses_en, houses_en_summary;
    +    SELECT {schema_madlib}.elastic_net_train(
    +             'houses',                  -- source table
    +             'houses_en',               -- result table
    +             'price',                   -- dependent variable
    +             'array[tax, bath, size]',  -- independent variable
    +             'gaussian',                -- regression family
    +             0.5,                       -- alpha value
    +             0.1,                       -- lambda value
    +             TRUE,                      -- standardize
    +             NULL,                      -- grouping column(s)
    +             'fista',                   -- optimizer
    +             '',                        -- optimizer parameters
    +             NULL,                      -- excluded columns
    +             10000,                     -- maximum iterations
    +             1e-6                       -- tolerance value
    +    );
    +
    +    View the resulting model:
    +    \\x on
    +    SELECT * FROM houses_en;
    +    \\x off
    +
    +    Use the prediction function to evaluate residuals:
    +    SELECT id, price, predict, price - predict AS residual
    +    FROM (
    +    SELECT
    +        houses.*,
    +        {schema_madlib}.elastic_net_gaussian_predict(
    +            m.coef_all,
    +            m.intercept,
    +            ARRAY[tax,bath,size]
    +            ) AS predict
    +    FROM houses, houses_en m) s
    +    ORDER BY id;
    +
    +    Additional Example (with grouping):
    +    DROP TABLE IF EXISTS houses_en1, houses_en1_summary;
    +    SELECT {schema_madlib}.elastic_net_train( 'houses',
    +             'houses_en1',
    +             'price',
    +             'array[tax, bath, size]',
    +             'gaussian',
    +             1,
    +             30000,
    +             TRUE,
    +             'grp_by_col',
    +             'fista',
    +             '',
    +             NULL,
    +             10000,
    +             1e-6
    +           );
    +
    +    View the resulting model and see a separate model for each group:
    +    \\x on
    +    SELECT * FROM houses_en1;
    +    \\x off
    +
    +    Use the prediction function to evaluate residuals:
    +    SELECT {schema_madlib}.elastic_net_predict(
    +            'houses_en1',            -- model table
    +            'houses',                -- new source data table
    +            'id',                    -- unique ID associated with each row
    +            'houses_en1_prediction'  -- table to store prediction result
    +          );
    +
    +    View the results:
    +    SELECT  houses.id, 
    +            houses.price, 
    +            houses_en1_prediction.prediction, 
    +            houses.price - houses_en1_prediction.prediction AS residual 
    +    FROM houses_en1_prediction, houses 
    +    WHERE houses.id=houses_en1_prediction.id;
    +
    +        """
    +
    +    if (family_or_optimizer.lower() in ('usage', 'help', '?')):
             return """
    -        ----------------------------------------------------------------
    -                                USAGE
    -        ----------------------------------------------------------------
    -        SELECT {schema_madlib}.elastic_net_train (
    -            'tbl_source',      -- Data table
    -            'tbl_result',      -- Result table
    -            'col_dep_var',     -- Dependent variable, can be an expression 
or
    -                                    '*'
    -            'col_ind_var',     -- Independent variable, can be an 
expression
    -            'regress_family',  -- 'gaussian' (or 'linear'). 'binomial'
    -                                    (or 'logistic')
    -            alpha,             -- Elastic net control parameter, value in 
[0, 1]
    -            lambda_value,      -- Regularization parameter, positive
    -            standardize,       -- Whether to normalize the data
    -            'grouping_col',    -- Group by which columns. (DEFAULT: NULL)
    -            'optimizer',       -- Name of optimizer. (DEFAULT: 'fista')
    -            'optimizer_params',-- Comma-separated string of optimizer 
parameters
    -            'excluded',        -- Column names excluded from '*' (DEFAULT 
= NULL)
    -            max_iter,          -- Maximum iteration number (DEFAULT = 1000)
    -            tolerance          -- Stopping criteria (DEFAULT = 1e-4)
    -        );
    -        ----------------------------------------------------------------
    -                                OUTPUT
    -        ----------------------------------------------------------------
    -        The output table ('tbl_result' above) has the following columns:
    -        family            TEXT,       -- 'gaussian' or 'binomial'
    -        features          TEXT[],     -- All feature column names
    -        features_selected TEXT[],     -- Features with non-zero 
coefficients
    -        coef_nonzero      DOUBLE PRECISION[], -- Non-zero coefficients
    -        coef_all          DOUBLE PRECISION[], -- All coefficients
    -        intercept         DOUBLE PRECISION,   -- Intercept of the linear 
fit
    -        log_likelihood    DOUBLE PRECISION,   -- log-likelihood of the fit
    -        standardize       BOOLEAN,    -- Whether the data was standardized
    -                                         before fitting
    -        iteration_run     INTEGER     -- How many iteration was actually 
run
    -
    -        If the independent variable is a column with type of array, 
features
    -        and features_selected will output indices of the array.
    +    ----------------------------------------------------------------
    +                            USAGE
    +    ----------------------------------------------------------------
    +    SELECT {schema_madlib}.elastic_net_train (
    +        'tbl_source',      -- Data table
    +        'tbl_result',      -- Result table
    +        'col_dep_var',     -- Dependent variable, can be an expression or
    +                                '*'
    +        'col_ind_var',     -- Independent variable, can be an expression
    +        'regress_family',  -- 'gaussian' (or 'linear'). 'binomial'
    +                                (or 'logistic')
    +        alpha,             -- Elastic net controlparameter, value in [0, 1]
    +        lambda_value,      -- Regularization parameter, positive
    +        standardize,       --Whether to normalize the ata
    +        'grouping_col',    -- Group by which columns. (DEFAULT: NULL)
    +        'optimizer',       --Name of optimizer. (DEFAUT: 'fista')
    +        'optimizer_params',-- Comma-separated string of optimizer 
parameters
    +        'excluded',        -- Column names excluded frm '*' (DEFAULT = 
NULL)
    +        max_iter,          -- Maximum iteration numbr (DEFAULT = 1000)
    +        tolerance          -- Stopping criteria (DEFAULT = 1e-4)
    +    );
    +    ----------------------------------------------------------------
    +                            OUTPUT
    +    ----------------------------------------------------------------
    +    The output table ('tbl_result' above) has the following columns:
    +    grouping_col      TEXT,               --'Distinct values of 
groupng_col'
    +    family            TEXT,               --'gaussian' or 'binomial'
    +    features          TEXT[],             -- All feature column names
    +    features_selected TEXT[],             -- Features with non-zero 
coefficients
    +    coef_nonzero      DOUBLE PRECISION[], -- Non-zero coefficients
    +    coef_all          DOUBLE PRECISION[], -- All coefficients
    +    intercept         DOUBLE PRECISION,   -- Intercept of the linear fit
    +    log_likelihood    DOUBLE PRECISION,   -- log-likelihood of the fit
    +    standardize       BOOLEAN,            -- Whether the data was 
standardized
    +                                          -- before fitting
    +    iteration_run     INTEGER             -- How many iteration was 
actually run
    +
    +    If the independent variable is a column with type of array, features
    +    and features_selected will output indices of the array.
             """.format(schema_madlib=schema_madlib)
     
         if family_or_optimizer.lower() == "predict":
             return """
    -        ----------------------------------------------------------------
    -                                Prediction
    -        ----------------------------------------------------------------
    -        SELECT {schema_madlib}.elastic_net_predict(
    -            'regress_family', -- 'gaussian' (or 'linear'). 'binomial'
    -                                  (or 'logistic') will be supported
    -            coefficients,     -- Fitting coefficients as a double
    -                                 array
    -            intercept,
    -            ind_var           -- independent variables
    +    ----------------------------------------------------------------
    +                            Prediction
    +    ----------------------------------------------------------------
    +    SELECT {schema_madlib}.elastic_net_predict(
    +        'regress_family', -- 'gaussian' (or 'linear'). 'binomial'
    +                          --  (or 'logistic') will be supported
    +        coefficients,     -- Fitting coefficients as a double array
    +        intercept,
    +        ind_var           -- independent variables
    +    ) FROM tbl_result, tbl_new_source;
    +
    +    When predicting with binomial models, the return value is 1
    +    if the predicted result is True, and 0 if the prediction is
    +    False.
    +
    +    OR -------------------------------------------------------------
    +
    +    (1) SELECT {schema_madlib}.elastic_net_gaussian_predict (
    +            coefficients, intercept, ind_var
             ) FROM tbl_result, tbl_new_source;
     
    -        When predicting with binomial models, the return value is 1
    -        if the predicted result is True, and 0 if the prediction is
    -        False.
    -
    -        OR -------------------------------------------------------------
    -
    -        (1) SELECT {schema_madlib}.elastic_net_gaussian_predict (
    -                coefficients, intercept, ind_var
    -            ) FROM tbl_result, tbl_new_source;
    -
    -        (2) SELECT {schema_madlib}.elastic_net_binomial_predict (
    -                coefficients, intercept, ind_var
    -            ) FROM tbl_result, tbl_new_source;
    +    (2) SELECT {schema_madlib}.elastic_net_binomial_predict (
    +            coefficients, intercept, ind_var
    +        ) FROM tbl_result, tbl_new_source;
     
     
    -        (3) SELECT {schema_madlib}.elastic_net_binomial_prob (
    -                coefficients, intercept, ind_var
    -            ) FROM tbl_result, tbl_new_source;
    +    (3) SELECT {schema_madlib}.elastic_net_binomial_prob (
    +            coefficients, intercept, ind_var
    +        ) FROM tbl_result, tbl_new_source;
     
    -            This returns probability values for the class being 'True'.
    +        This returns probability values for the class being 'True'.
     
    -        OR -------------------------------------------------------------
    +    OR -------------------------------------------------------------
     
    -        SELECT {schema_madlib}.elastic_net_predict(
    -            'tbl_model',      -- Result table of elastic_net_train
    -            'tbl_new_source', -- New data source
    -            'col_id',         -- Unique ID column
    -            'tbl_predict'     -- Prediction result
    -        );
    -        will put all prediction results into a table. This can be
    -        used together with cross_validation_general() function.
    +    SELECT {schema_madlib}.elastic_net_predict(
    +        'tbl_model',      -- Result table of elastic_net_train
    +        'tbl_new_source', -- New data source
    +        'col_id',         -- Unique ID column
    +       'tbl_predict'      -- Prediction result
    +    );
    +    will put all prediction results into a table. This can be
    +    used together with cross_validation_general() function.
     
    -        When predicting with binomial models, the predicted values
    -        are BOOLEAN.
    +    When predicting with binomial models, the predicted values
    +    are BOOLEAN.
             """.format(schema_madlib=schema_madlib)
     
         if (family_or_optimizer.lower() in ("gaussian", "linear")):
             return """
    -        ----------------------------------------------------------------
    -        Fitting linear models
    -        ----------------------------------------------------------------
    -        Supported optimizer:
    -        (1) Incremental gradient descent method ('igd')
    -        (2) Fast iterative shrinkage thesholding algorithm ('fista')
    -
    -        Default is 'fista'
    -        --
    -        Run:
    -        SELECT {schema_madlib}.elastic_net_train('optimizer');
    -        to see more help on each optimizer.
    +    ----------------------------------------------------------------
    +    Fitting linear models
    +    ----------------------------------------------------------------
    +    Supported optimizer:
    +    (1) Incremental gradient descent method ('igd')
    +    (2) Fast iterative shrinkage thesholding algorithm ('fista')
    +
    +    Default is 'fista'
    +    --
    +    Run:
    +    SELECT {schema_madlib}.elastic_net_train('optimizer');
    +    to see more help on each optimizer.
             """.format(schema_madlib=schema_madlib)
     
         if (family_or_optimizer.lower() in ("binomial", "logistic")):
             return """
    -        ----------------------------------------------------------------
    -        Fitting logistic models
    -        ----------------------------------------------------------------
    -        The dependent variable must be a BOOLEAN.
    -
    -        Supported optimizer:
    -        (1) Incremental gradient descent method ('igd')
    -        (2) Fast iterative shrinkage thesholding algorithm ('fista')
    -
    -        Default is 'fista'
    -        --
    -        Run:
    -        SELECT {schema_madlib}.elastic_net_train('optimizer');
    -        to see more help on each optimizer.
    +    ----------------------------------------------------------------
    +    Fitting logistic models
    +    ----------------------------------------------------------------
    +    The dependent variable must be a BOOLEAN.
    +
    +    Supported optimizer:
    +    (1) Incremental gradient descent method ('igd')
    +    (2) Fast iterative shrinkage thesholding algorithm ('fista')
    +
    +    Default is 'fista'
    +    --
    +    Run:
    +    SELECT {schema_madlib}.elastic_net_train('optimizer');
    +    to see more help on each optimizer.
             """.format(schema_madlib=schema_madlib)
     
         if family_or_optimizer.lower() == "igd":
             return """
    -        ----------------------------------------------------------------
    -        Incremental gradient descent (IGD) method
    -        ----------------------------------------------------------------
    -        Right now, it supports fitting both linear and logistic models.
    -
    -        In order to obtain sparse coefficients, a
    -        modified version of IGD is actually used.
    -
    -        Parameters --------------------------------
    -        stepsize         - default is 0.01
    -        threshold        - default is 1e-10. When a coefficient is really
    -                           small, set it to be 0
    -        warmup           - default is False
    -        warmup_lambdas   - default is Null
    -        warmup_lambda_no - default is 15. How many lambda's are used in
    -                           warm-up, will be overridden if warmup_lambdas
    -                           is not NULL
    -        warmup_tolerance - default is the same as tolerance. The value
    -                           of tolerance used during warmup.
    -        parallel         - default is True. Run the computation on
    -                           multiple segments or not.
    -
    -        When warmup is True and warmup_lambdas is NULL, a series
    -        of lambda values will be automatically generated and used.
    -
    -        Reference --------------------------------
    -        [1] Shai Shalev-Shwartz and Ambuj Tewari, Stochastic Methods for l1
    -            Regularized Loss Minimization. Proceedings of the 26th Interna-
    -            tional Conference on Machine Learning, Montreal, Canada, 2009.
    +    ----------------------------------------------------------------
    +    Incremental gradient descent (IGD) method
    +    ----------------------------------------------------------------
    +    Right now, it supports fitting both linear and logistic models.
    +
    +    In order to obtain sparse coefficients, a
    +    modified version of IGD is actually used.
    +
    +    Parameters --------------------------------
    +    stepsize         - default is 0.01
    +    threshold        - default is 1e-10. When a coefficient is really
    +                       small, set it to be 0
    +    warmup           - default is False
    +    warmup_lambdas   - default is Null
    +    warmup_lambda_no - default is 15. How many lambda's are used in
    +                       warm-up, will be overridden if warmup_lambdas
    +                       is not NULL
    +    warmup_tolerance - default is the same as tolerance. The value
    +                       of tolerance used during warmup.
    +    parallel         - default is True. Run the computation on
    +                       multiple segments or not.
    +
    +    When warmup is True and warmup_lambdas is NULL, a series
    +    of lambda values will be automatically generated and used.
    +
    +    Reference --------------------------------
    +    [1] Shai Shalev-Shwartz and Ambuj Tewari, Stochastic Methods for l1
    +        Regularized Loss Minimization. Proceedings of the 26th Interna-
    +        tional Conference on Machine Learning, Montreal, Canada, 2009.
             """
     
         if family_or_optimizer.lower() == "fista":
             return """
    -        ----------------------------------------------------------------
    -        Fast iterative shrinkage thesholding algorithm
    -        with backtracking for stepsizes
    -        ----------------------------------------------------------------
    -        Right now, it supports fitting both linear and logistic models.
    -
    -        Parameters --------------------------------
    -        max_stepsize     - default is 4.0
    -        eta              - default is 1.2, if stepsize does not work
    -                           stepsize/eta will be tried
    -        warmup           - default is False
    -        warmup_lambdas   - default is NULL, which means that lambda
    -                           values will be automatically generated
    -        warmup_lambda_no - default is 15. How many lambda's are used in
    -                           warm-up, will be overridden if warmup_lambdas
    -                           is not NULL
    -        warmup_tolerance - default is the same as tolerance. The value
    -                           of tolerance used during warmup.
    -        use_active_set   - default is False. Sometimes active-set method
    -                           can speed up the calculation.
    -        activeset_tolerance - default is the same as tolerance. The
    -                              value of tolerance used during active set
    -                              calculation
    -        random_stepsize - default is False. Whether add some randomness
    -                          to the step size. Sometimes, this can speed
    -                          up the calculation.
    -
    -        When warmup is True and warmup_lambdas is NULL, warmup_lambda_no
    -        of lambda values will be automatically generated and used.
    -
    -        Reference --------------------------------
    -        [1] Beck, A. and M. Teboulle (2009), A fast iterative
    -            shrinkage-thresholding algorothm for linear inverse
    -            problems. SIAM J. on Imaging Sciences 2(1), 183-202.
    +    ----------------------------------------------------------------
    +    Fast iterative shrinkage thesholding algorithm
    +    with backtracking for stepsizes
    +    ----------------------------------------------------------------
    +    Right now, it supports fitting both linear and logistic models.
    +
    +    Parameters --------------------------------
    +    max_stepsize     - default is 4.0
    +    eta              - default is 1.2, if stepsize does not work
    +                       stepsize/eta will be tried
    +    warmup           - default is False
    +    warmup_lambdas   - default is NULL, which means that lambda
    +                       values will be automatically generated
    +    warmup_lambda_no - default is 15. How many lambda's are used in
    +                       warm-up, will be overridden if warmup_lambdas
    +                       is not NULL
    +    warmup_tolerance - default is the same as tolerance. The value
    +                       of tolerance used during warmup.
    +    use_active_set   - default is False. Sometimes active-set method
    +                       can speed up the calculation.
    +    activeset_tolerance - default is the same as tolerance. The
    +                          value of tolerance used during active set
    +                          calculation
    +    random_stepsize - default is False. Whether add some randomness
    +                      to the step size. Sometimes, this can speed
    +                      up the calculation.
    +
    +    When warmup is True and warmup_lambdas is NULL, warmup_lambda_no
    +    of lambda values will be automatically generated and used.
    +
    +    Reference --------------------------------
    +    [1] Beck, A. and M. Teboulle (2009), A fast iterative
    +        shrinkage-thresholding algorothm for linear inverse
    +        problems. SIAM J. on Imaging Sciences 2(1),, 183-202.
    --- End diff --
    
    double comma?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #69: Elastic Net: Add grouping support

Reply via email to