Github user vruusmann commented on the pull request:
https://github.com/apache/spark/pull/9207#issuecomment-217136218
A thought about designing an interface for exporting ML solutions
(exemplified using PMML, but should be generalizable to other data formats as
well).
Namely, the export method should have a parameter, which gives the "best
guess"-type of description of the associated feature schema:
```
interface PMMLExportable {
void toPMML(OutputStream outputStream, FeatureSchema featureSchema);
}
```
It would be desirable to know the name, operational type (eg. continuous,
categorical) and data type (eg. integer, float, double) of individual features,
because this way it will be possible to generate more specialized/informative
content.
Example one: description of regression terms:
```xml
<!-- Dummy feature schema -->
<NumericPredictor name="x33" coefficient="0.3401013140365875"/>
<NumericPredictor name="x34" coefficient="-1.2512420398555455"/>
<!-- Rich feature schema -->
<CategoricalPredictor name="Occupation" value="Executive"
coefficient="0.3401013140365875"/>
<CategoricalPredictor name="Occupation" value="Farming"
coefficient="-1.2512420398555455"/>
```
Example two: description of tree split conditions:
```xml
<!-- Dummy feature schema -->
<SimplePredicate field="x33" operator="equal" value="1.0"/>
<!-- Rich feature schema -->
<SimplePredicate field="Occupation" operator="equal" value="Executive"/>
```
Current `PMMLExportable` implementation classes are hard-coded to assume a
dummy feature schema (ie. all features are continuous doubles that are named
`x1`, `x2`, .., `x${numFeatures}`).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]