Github user hirakendu commented on the pull request:
https://github.com/apache/spark/pull/79#issuecomment-39385736
Thanks Manish, Evan, Ameet, Ram, and thanks Xiangrui, Matei. It's great to
have this PR merged, but I think there is room for a lot of improvement, and I
hope this thread is still open for comments. I have some detailed notes from my
side, we can track them in a separate place if required.
Some design notes to start with:
The current design of classes and relationships is good,
but I think it would be great if it can be modified slightly to
make it similar to the design of other existing algorithms in MLLib
and extend the existing interfaces.
For example, we can have a `DecisionTreeModel` or that extends
the existing `RegressionModel` interface, similar to
the existing `RidgeRegressionModel` and `LinearRegressionModel`.
Alternatively, we can have separate `ClassificationTreeModel`
and `RegressionTreeModel` that extend the existing interfaces
`ClassificationModel` and `RegressionModel` respectively.
Note that it is important to keep the `Model` as a class,
so that we can later compose ensemble and other composite models
consisting of multiple `Model` instances.
Currently the decision tree `Node` is essentially the `Model`,
although I would prefer a wrapper around it and explicitly
called `DecisionTreeModel` that implements methods like
`predict`, `save`, `explain`, `load` like I remember
coming across in MLI.
In the file `DecisionTree.scala`, the actual class
can be renamed to `DecisionTreeAlgorithm` that extends
`Algorithm` interface that implements `train(samples)`
and outputs a `RegressionModel`.
Note that there is no `Algorithm` interface currently in MLLib,
although there may be one in MLI and the closest is
`abstract class GeneralizedLinearAlgorithm[M <: GeneralizedLinearModel]`.
Modeling `Strategy` should be renamed to `Parameters`.
I would prefer a separation between model parameters
like depth, minimum gain and algorithm parameters
like level-by-level training or caching.
The line is blurred for some aspects like quantile strategy,
but I am inclined to put those into modeling parameters,
so it can be referred to later on.
Rule of thumb being that different algorithm parameters
should lead to same model (up to randomization effects)
for the same model parameters.
`Impurity` should be renamed to `Error` or something more technical.
Also see my later comments on the need for a generic `Error` Interface
that allow easy adaptation of algorithms for specific loss functions.
In general, MLLib and MLI should define proper interfaces
for `Model`, `Algorithm`, `Parameters` and importantly `Loss` entities,
at least for supervised machine learning algorithms
and all implementations should adhere to it.
Surprisingly, MLLIb or MLI currently doesn't have a `Loss` or `Error`
interface,
the closest being the `Optimizer` interface.
There is also a need for portable model output formats, e.g., JSON,
that can be used by other programs, possibly written in other languages
outside Scala and Java.
Can also use an existing format like PMML (not sure if it's widely used).
Lastly, there is a need to support standard data formats with
optional delimiter parameters - I am sure this a general need for Spark.
I understand there has been significant effort before for standardization,
would be good to know about the current status.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---