kaknikhil opened a new pull request, #613:
URL: https://github.com/apache/madlib/pull/613

   JIRA: MADLIB-1517
   
   Note that this PR only fixes GLM, logisitic and linear. A future PR
   will fix other pmml modules.
   
   # Context :
   MADlib's way of passing intercept to regression models is a bit unusual.
   Usually intercept is a boolean which indicates whether the model needs to be
   fit with intercept or not. MADlib makes the user pass an integer (1 means use
   an intercept and no value means don't fit with intercept) along with the 
other
   independent variables and uses that directly for computation. For e.g.
   ARRAY[1,x1,x2,...] indicates use an intercept whereas ARRAY[x1,x2,...] means
   don't use an intercept
   
   # Problem:
   * So essentially all the regression algorithms treat the intercept value "1" 
as
     just another independent variable(it's always the first one though).
   * Because of this implicit assumption, users need to specifically inject a
     predictor named "1" with a value of 1 for all the input rows. This can be 
very
     inconvenient specially when using pmml to predict a stream of data or some
     other preprocessed form of data.
   
   # Fix:
   * Once the model is trained, the only way to know if the model was fit with
     intercept is to look at the `independent_varname` field in the summary 
table.
   * If the value contains ARRAY[1, x1, x2..], then an intercept was used.
   * Since this intercept is just another independent variable, there aren't any
     explicit references or logic to handle intercept in our python or c++ code 
for
     training or predict.
   * Because of this assumption, the pmml code also considers all of
     "ARRAY[1,x1,x2,...]" as independent variables and hence the output pmml
     contains "1" as an input predictor.
   * We can just remove all references to the column "1" in the pmml file. We 
will
     still keep the "p0" variable which is explicitly marked as an intercept and
     will store the intercept's coefficient
   * The pmml module gets the 'X' and 'Y' values from the summary table and then
     parses it to create a list of all the independent predictors so that it can
     be written to the pmml file
   * It uses regex to match the expression ARRAY[1,x1,x2]/ARRAY[x1,x2] and then
     returns either ['1','x1','x2'] or ['x1',x2']
   * Our goal with the pmml code is to not treat the intercept "1" as an
     independent predictor but just as an intercept
   * The commit fixes this by changing the regex and using the output to 
determine
     if an intercept was passed so that both expressions
     ARRAY[1,x1,x2]/ARRAY[x1,x2] return ['x1', 'x2']
   * Also had to make changes to the various pmml builder classes to treat
     intercept's coefficient differently than the feature coefficients
   * Note that this PR only fixes GLM, logisitic and linear. A future PR
     will fix other pmml modules.
   
   Before the fix:
   ```
   <?xml version="1.0" standalone="yes"?>
   <PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1";>
    <Header copyright="Copyright (c) 2024 nkak">
      <Extension extender="MADlib" name="user" value="nkak"/>
      <Application name="MADlib" version="2.1.0"/>
      <Timestamp>2024-02-16 12:19:58.798139 PDT</Timestamp>
    </Header>
    <DataDictionary numberOfFields="4">
      <DataField name="second_attack_pmml_prediction" optype="categorical" 
dataType="boolean">
        <Value value="True"/>
        <Value value="False"/>
      </DataField>
      <DataField name="1" optype="continuous" dataType="double"/>
      <DataField name="treatment" optype="continuous" dataType="double"/>
      <DataField name="trait_anxiety" optype="continuous" dataType="double"/>
    </DataDictionary>
    <RegressionModel functionName="classification" 
normalizationMethod="softmax">
      <MiningSchema>
        <MiningField name="second_attack_pmml_prediction" 
usageType="predicted"/>
        <MiningField name="1"/>
        <MiningField name="treatment"/>
        <MiningField name="trait_anxiety"/>
      </MiningSchema>
      <RegressionTable intercept="0.0" targetCategory="True">
        <NumericPredictor name="1" coefficient="-6.363469941781864"/>
        <NumericPredictor name="treatment" coefficient="-1.0241060523932668"/>
        <NumericPredictor name="trait_anxiety" 
coefficient="0.11904491666860616"/>
      </RegressionTable>
      <RegressionTable intercept="0.0" targetCategory="False"/>
    </RegressionModel>
   </PMML>
   ```
   
   After the fix:
   ```
   <?xml version="1.0" standalone="yes"?>
   <PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1";>
     <Header copyright="Copyright (c) 2024 nkak">
       <Extension extender="MADlib" name="user" value="nkak"/>
       <Application name="MADlib" version="2.1.0"/>
       <Timestamp>2024-02-16 13:37:15.367609 PDT</Timestamp>
     </Header>
     <DataDictionary numberOfFields="3">
       <DataField name="second_attack_pmml_prediction" optype="categorical" 
dataType="boolean">
         <Value value="True"/>
         <Value value="False"/>
       </DataField>
       <DataField name="treatment" optype="continuous" dataType="double"/>
       <DataField name="trait_anxiety" optype="continuous" dataType="double"/>
     </DataDictionary>
     <RegressionModel functionName="classification" 
normalizationMethod="softmax">
       <MiningSchema>
         <MiningField name="second_attack_pmml_prediction" 
usageType="predicted"/>
         <MiningField name="treatment"/>
         <MiningField name="trait_anxiety"/>
       </MiningSchema>
       <RegressionTable intercept="-6.36346994178186" targetCategory="True">
         <NumericPredictor name="treatment" coefficient="-1.0241060523932697"/>
         <NumericPredictor name="trait_anxiety" 
coefficient="0.11904491666860609"/>
       </RegressionTable>
       <RegressionTable intercept="0.0" targetCategory="False"/>
     </RegressionModel>
   </PMML>
   
   ```
   
   # Risks and limitations:
   1. Straying away from the intrinsic intercept assumption only for the pmml 
code:
     * As we have established already, the intercept is not treated any 
different
       from an independent variable.
     * To fix the pmml file to not include the intercept as an independent 
variable,
       we will need to break this intrinsic assumption.
     * If the pmml code breaks this assumption, it's possible that there might 
be
       some unexpected side effects or errors that even with exhaustive testing 
may
       not be uncovered. For e.g. the pmml code relies on the len of the 
coefficient
       to make some decisions about naming and such. (See formula.py for an 
example)
       which might have some weird edge cases
   
       We might be fine with this risk for now and if something breaks in the 
future,
       we can deal with it later. Biggest risk is that something fundamental 
breaks in
       the future that might make us revert this new logic. But the odds of 
that are
       pretty low
   
   2. If the user passed a non array expression for the independent variable
       
      Consider the following example
   
      ```
      -- Create a table where the x variable is an array of the independent 
variables to train on
      CREATE TABLE warpbreaks_dummy_simple_xcol AS SELECT breaks AS y, 
ARRAY[1,"wool_B","tension_M", "tension_H"] 
      AS x_a from warpbreaks_dummy;
   
      -- Now use the column 'x_a' created in the previous step.
      SELECT madlib.glm('warpbreaks_dummy_simple_xcol', 
'glm_warpbreaks_intercept_1_simple_xcol', 'y' , 'x_a' , 
      'family=poisson, link=log');
       ```
       Now there's no way for us know if this model was fit with an intercept 
or not.
       The only way to know is to check the value of "independent_varname" in 
the
       summary table which would be "x_A" in this case which won't tell us 
anything
       about the intercept.
   
       Ideally, we would like to change the fit functions to take a boolen for 
the
       intercept arg but that will too big of a change and hence is out of 
scope of this commit.
   
       The easiest fix for this problem for now is that we are going to assume 
that
       all non array expressions always include the intercept. Note that this
       assumption only applies to the pmml module
   
   3. Using the name_spec arg of the pmml function
   
     * The pmml function accepts an optional arg named "name_spec" which is 
used to explicitly name the input and output variables in the pmml file.
     * The user will now need to remove the "1" from this expression
        For e.g. `SELECT madlib.pmml('patients_logregr', 
'attack~1+anxiety+treatment');` will have to be rewritten as
         `SELECT madlib.pmml('patients_logregr', 'attack~anxiety+treatment');`
     * We will need to remove this from the pmml user docs which will be done 
in a separate PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@madlib.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to