kaknikhil opened a new pull request, #613: URL: https://github.com/apache/madlib/pull/613
JIRA: MADLIB-1517 Note that this PR only fixes GLM, logisitic and linear. A future PR will fix other pmml modules. # Context : MADlib's way of passing intercept to regression models is a bit unusual. Usually intercept is a boolean which indicates whether the model needs to be fit with intercept or not. MADlib makes the user pass an integer (1 means use an intercept and no value means don't fit with intercept) along with the other independent variables and uses that directly for computation. For e.g. ARRAY[1,x1,x2,...] indicates use an intercept whereas ARRAY[x1,x2,...] means don't use an intercept # Problem: * So essentially all the regression algorithms treat the intercept value "1" as just another independent variable(it's always the first one though). * Because of this implicit assumption, users need to specifically inject a predictor named "1" with a value of 1 for all the input rows. This can be very inconvenient specially when using pmml to predict a stream of data or some other preprocessed form of data. # Fix: * Once the model is trained, the only way to know if the model was fit with intercept is to look at the `independent_varname` field in the summary table. * If the value contains ARRAY[1, x1, x2..], then an intercept was used. * Since this intercept is just another independent variable, there aren't any explicit references or logic to handle intercept in our python or c++ code for training or predict. * Because of this assumption, the pmml code also considers all of "ARRAY[1,x1,x2,...]" as independent variables and hence the output pmml contains "1" as an input predictor. * We can just remove all references to the column "1" in the pmml file. We will still keep the "p0" variable which is explicitly marked as an intercept and will store the intercept's coefficient * The pmml module gets the 'X' and 'Y' values from the summary table and then parses it to create a list of all the independent predictors so that it can be written to the pmml file * It uses regex to match the expression ARRAY[1,x1,x2]/ARRAY[x1,x2] and then returns either ['1','x1','x2'] or ['x1',x2'] * Our goal with the pmml code is to not treat the intercept "1" as an independent predictor but just as an intercept * The commit fixes this by changing the regex and using the output to determine if an intercept was passed so that both expressions ARRAY[1,x1,x2]/ARRAY[x1,x2] return ['x1', 'x2'] * Also had to make changes to the various pmml builder classes to treat intercept's coefficient differently than the feature coefficients * Note that this PR only fixes GLM, logisitic and linear. A future PR will fix other pmml modules. Before the fix: ``` <?xml version="1.0" standalone="yes"?> <PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1"> <Header copyright="Copyright (c) 2024 nkak"> <Extension extender="MADlib" name="user" value="nkak"/> <Application name="MADlib" version="2.1.0"/> <Timestamp>2024-02-16 12:19:58.798139 PDT</Timestamp> </Header> <DataDictionary numberOfFields="4"> <DataField name="second_attack_pmml_prediction" optype="categorical" dataType="boolean"> <Value value="True"/> <Value value="False"/> </DataField> <DataField name="1" optype="continuous" dataType="double"/> <DataField name="treatment" optype="continuous" dataType="double"/> <DataField name="trait_anxiety" optype="continuous" dataType="double"/> </DataDictionary> <RegressionModel functionName="classification" normalizationMethod="softmax"> <MiningSchema> <MiningField name="second_attack_pmml_prediction" usageType="predicted"/> <MiningField name="1"/> <MiningField name="treatment"/> <MiningField name="trait_anxiety"/> </MiningSchema> <RegressionTable intercept="0.0" targetCategory="True"> <NumericPredictor name="1" coefficient="-6.363469941781864"/> <NumericPredictor name="treatment" coefficient="-1.0241060523932668"/> <NumericPredictor name="trait_anxiety" coefficient="0.11904491666860616"/> </RegressionTable> <RegressionTable intercept="0.0" targetCategory="False"/> </RegressionModel> </PMML> ``` After the fix: ``` <?xml version="1.0" standalone="yes"?> <PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1"> <Header copyright="Copyright (c) 2024 nkak"> <Extension extender="MADlib" name="user" value="nkak"/> <Application name="MADlib" version="2.1.0"/> <Timestamp>2024-02-16 13:37:15.367609 PDT</Timestamp> </Header> <DataDictionary numberOfFields="3"> <DataField name="second_attack_pmml_prediction" optype="categorical" dataType="boolean"> <Value value="True"/> <Value value="False"/> </DataField> <DataField name="treatment" optype="continuous" dataType="double"/> <DataField name="trait_anxiety" optype="continuous" dataType="double"/> </DataDictionary> <RegressionModel functionName="classification" normalizationMethod="softmax"> <MiningSchema> <MiningField name="second_attack_pmml_prediction" usageType="predicted"/> <MiningField name="treatment"/> <MiningField name="trait_anxiety"/> </MiningSchema> <RegressionTable intercept="-6.36346994178186" targetCategory="True"> <NumericPredictor name="treatment" coefficient="-1.0241060523932697"/> <NumericPredictor name="trait_anxiety" coefficient="0.11904491666860609"/> </RegressionTable> <RegressionTable intercept="0.0" targetCategory="False"/> </RegressionModel> </PMML> ``` # Risks and limitations: 1. Straying away from the intrinsic intercept assumption only for the pmml code: * As we have established already, the intercept is not treated any different from an independent variable. * To fix the pmml file to not include the intercept as an independent variable, we will need to break this intrinsic assumption. * If the pmml code breaks this assumption, it's possible that there might be some unexpected side effects or errors that even with exhaustive testing may not be uncovered. For e.g. the pmml code relies on the len of the coefficient to make some decisions about naming and such. (See formula.py for an example) which might have some weird edge cases We might be fine with this risk for now and if something breaks in the future, we can deal with it later. Biggest risk is that something fundamental breaks in the future that might make us revert this new logic. But the odds of that are pretty low 2. If the user passed a non array expression for the independent variable Consider the following example ``` -- Create a table where the x variable is an array of the independent variables to train on CREATE TABLE warpbreaks_dummy_simple_xcol AS SELECT breaks AS y, ARRAY[1,"wool_B","tension_M", "tension_H"] AS x_a from warpbreaks_dummy; -- Now use the column 'x_a' created in the previous step. SELECT madlib.glm('warpbreaks_dummy_simple_xcol', 'glm_warpbreaks_intercept_1_simple_xcol', 'y' , 'x_a' , 'family=poisson, link=log'); ``` Now there's no way for us know if this model was fit with an intercept or not. The only way to know is to check the value of "independent_varname" in the summary table which would be "x_A" in this case which won't tell us anything about the intercept. Ideally, we would like to change the fit functions to take a boolen for the intercept arg but that will too big of a change and hence is out of scope of this commit. The easiest fix for this problem for now is that we are going to assume that all non array expressions always include the intercept. Note that this assumption only applies to the pmml module 3. Using the name_spec arg of the pmml function * The pmml function accepts an optional arg named "name_spec" which is used to explicitly name the input and output variables in the pmml file. * The user will now need to remove the "1" from this expression For e.g. `SELECT madlib.pmml('patients_logregr', 'attack~1+anxiety+treatment');` will have to be rewritten as `SELECT madlib.pmml('patients_logregr', 'attack~anxiety+treatment');` * We will need to remove this from the pmml user docs which will be done in a separate PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@madlib.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org