[GitHub] spark pull request: [SPARK-3530][MLLIB] pipeline and parameters wi...

jkbradley Tue, 11 Nov 2014 11:04:51 -0800

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3099#issuecomment-62598509
  
    @etrain Thanks for all the feedback!  I hope you keep pushing for 
simplifications.  A few thoughts:
    
    > Tuning
    
    Looking ahead to more complicated cases like tuning seems really important 
to me.  (I actually worry a bit that we haven't thought quite far enough ahead 
to ensembles, distributed models, streaming, etc.)  I think we'll have to 
accept that this API may require significant discussion and some updates, and I 
hope we can simplify it too.  (For the Evaluator as a Transformer issue, I do 
think it's fundamentally different in terms of type (since it gives a scalar 
value) and in terms of concepts (and deserves its own class since evaluation is 
such a key part of stats/ML).
    
    > Even slightly complicated pipelines are tough to write. For example, if I 
have an Estimator that takes an Iterator[DataSet] and produces a Model, it is 
difficult to express in this framework.
    
    What @mengxr mentioned about more complex Pipelines will help.  But this 
also touches on the concept of a stateful Estimator, which would help a lot 
with your pipeline using an Iterator[Dataset].  I strongly believe we'll need 
stateful Estimators in the future: definitely for Streaming, and possibly for 
exposing iterative estimation algorithms.  I have something for the latter 
(iterative algs) in my class hierarchy prototype...as soon as I can finish it.
    
    > 7. I still think getting rid of setters and just using constructor 
arguments would simplify things here. As new versions of the same PipelineNode 
with more options get added, we'd need to add additional constructors and 
support calling in the old way for API compatibility - it's not pretty but I 
think it's better than the current proposal.
    
    Along with what @mengxr said, this kind of parameter bloat in constructors 
is also a major problem for ensembles.  Gradient boosting + decision trees 
gives a lot more than the 10-parameter limit from scalastyle.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3530][MLLIB] pipeline and parameters wi...

Reply via email to