Github user hirakendu commented on the pull request:

    https://github.com/apache/spark/pull/79#issuecomment-39385736
  
    Thanks Manish, Evan, Ameet, Ram, and thanks Xiangrui, Matei. It's great to 
have this PR merged, but I think there is room for a lot of improvement, and I 
hope this thread is still open for comments. I have some detailed notes from my 
side, we can track them in a separate place if required.
    
    Some design notes to start with:
    
    The current design of classes and relationships is good,
    but I think it would be great if it can be modified slightly to
    make it similar to the design of other existing algorithms in MLLib
    and extend the existing interfaces.
    For example, we can have a `DecisionTreeModel` or that extends
    the existing `RegressionModel` interface, similar to
    the existing `RidgeRegressionModel` and `LinearRegressionModel`.
    Alternatively, we can have separate `ClassificationTreeModel`
    and `RegressionTreeModel` that extend the existing interfaces
    `ClassificationModel` and `RegressionModel` respectively.
    
    Note that it is important to keep the `Model` as a class,
    so that we can later compose ensemble and other composite models
    consisting of multiple `Model` instances.
    Currently the decision tree `Node` is essentially the `Model`,
    although I would prefer a wrapper around it and explicitly
    called `DecisionTreeModel` that implements methods like
    `predict`, `save`, `explain`, `load` like I remember
    coming across in MLI.
    
    In the file `DecisionTree.scala`, the actual class
    can be renamed to `DecisionTreeAlgorithm` that extends
    `Algorithm` interface that implements `train(samples)`
    and outputs a `RegressionModel`.
    Note that there is no `Algorithm` interface currently in MLLib,
    although there may be one in MLI and the closest is
    `abstract class GeneralizedLinearAlgorithm[M <: GeneralizedLinearModel]`.
    
    Modeling `Strategy` should be renamed to `Parameters`.
    I would prefer a separation between model parameters
    like depth, minimum gain and algorithm parameters
    like level-by-level training or caching.
    The line is blurred for some aspects like quantile strategy,
    but I am inclined to put those into modeling parameters,
    so it can be referred to later on.
    Rule of thumb being that different algorithm parameters
    should lead to same model (up to randomization effects)
    for the same model parameters.
    
    `Impurity` should be renamed to `Error` or something more technical.
    Also see my later comments on the need for a generic `Error` Interface
    that allow easy adaptation of algorithms for specific loss functions.
    
    In general, MLLib and MLI should define proper interfaces
    for `Model`, `Algorithm`, `Parameters` and importantly `Loss` entities,
    at least for supervised machine learning algorithms
    and all implementations should adhere to it.
    Surprisingly, MLLIb or MLI currently doesn't have a `Loss` or `Error` 
interface,
    the closest being the `Optimizer` interface.
    There is also a need for portable model output formats, e.g., JSON,
    that can be used by other programs, possibly written in other languages
    outside Scala and Java.
    Can also use an existing format like PMML (not sure if it's widely used).
    Lastly, there is a need to support standard data formats with
    optional delimiter parameters - I am sure this a general need for Spark.
    I understand there has been significant effort before for standardization,
    would be good to know about the current status.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to