Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/3062#issuecomment-95830975
  
    @selvinsource I tested KMeans.toPMML, which looks good. Linear regressors 
look good too, but for linear classifiers `LogisticRegressionModel` and 
`SVMModel`:
    
    1. We should use `functionName="classification"` (as documented in 
http://www.dmg.org/v4-2-1/Regression.html). The following is an example 
logistic regression model in PMML (copied from the url above):
       ~~~xml
    <PMML xmlns="http://www.dmg.org/PMML-4_2"; version="4.2">
      <Header copyright="DMG.org"/>
      <DataDictionary numberOfFields="5">
        <DataField name="age" optype="continuous" dataType="double"/>
        <DataField name="work" optype="continuous" dataType="double"/>
        <DataField name="sex" optype="categorical" dataType="string">
          <Value value="0"/>
          <Value value="1"/>
        </DataField>
        <DataField name="minority" optype="categorical" dataType="integer">
          <Value value="0"/>
          <Value value="1"/>
        </DataField>
        <DataField name="jobcat" optype="categorical" dataType="string">
          <Value value="clerical"/>
          <Value value="professional"/>
          <Value value="trainee"/>
          <Value value="skilled"/>
        </DataField>
      </DataDictionary>
      <RegressionModel modelName="Sample for logistic regression" 
functionName="classification" algorithmName="logisticRegression" 
normalizationMethod="softmax" targetFieldName="jobcat">
        <MiningSchema>
          <MiningField name="age"/>
          <MiningField name="work"/>
          <MiningField name="sex"/>
          <MiningField name="minority"/>
          <MiningField name="jobcat" usageType="target"/>
        </MiningSchema>
        <RegressionTable intercept="46.418" targetCategory="clerical">
          <NumericPredictor name="age" exponent="1" coefficient="-0.132"/>
          <NumericPredictor name="work" exponent="1" coefficient="7.867E-02"/>
          <CategoricalPredictor name="sex" value="0" coefficient="-20.525"/>
          <CategoricalPredictor name="sex" value="1" coefficient="0.5"/>
          <CategoricalPredictor name="minority" value="0" 
coefficient="-19.054"/>
          <CategoricalPredictor name="minority" value="1" coefficient="0"/>
        </RegressionTable>
        <RegressionTable intercept="51.169" targetCategory="professional">
          <NumericPredictor name="age" exponent="1" coefficient="-0.302"/>
          <NumericPredictor name="work" exponent="1" coefficient="0.155"/>
          <CategoricalPredictor name="sex" value="0" coefficient="-21.389"/>
          <CategoricalPredictor name="sex" value="1" coefficient="0.1"/>
          <CategoricalPredictor name="minority" value="0" 
coefficient="-18.443"/>
          <CategoricalPredictor name="minority" value="1" coefficient="0"/>
        </RegressionTable>
        <RegressionTable intercept="25.478" targetCategory="trainee">
          <NumericPredictor name="age" exponent="1" coefficient="-0.154"/>
          <NumericPredictor name="work" exponent="1" coefficient="0.266"/>
          <CategoricalPredictor name="sex" value="0" coefficient="-2.639"/>
          <CategoricalPredictor name="sex" value="1" coefficient="0.8"/>
          <CategoricalPredictor name="minority" value="0" 
coefficient="-19.821"/>
          <CategoricalPredictor name="minority" value="1" coefficient="0.2"/>
        </RegressionTable>
        <RegressionTable intercept="0.0" targetCategory="skilled"/>
      </RegressionModel>
    </PMML>
    ~~~
    
    2. `threshold` is not encoded in the PMML, which should be there.
    
    3. We don't support multinomial logistic regression model. So we should 
check the number of classes and throw an exception if the model is multiclass.
    
    I think we should fix 1&2 and throw exception for 3 in this PR. The 
multi-class support can come later. Since the feature freeze deadline is close, 
do you have time to update this PR before next Wed (leaving some buffer for 
review)? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to