Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62598509
@etrain Thanks for all the feedback! I hope you keep pushing for
simplifications. A few thoughts:
> Tuning
Looking ahead to more complicated cases like tuning seems really important
to me. (I actually worry a bit that we haven't thought quite far enough ahead
to ensembles, distributed models, streaming, etc.) I think we'll have to
accept that this API may require significant discussion and some updates, and I
hope we can simplify it too. (For the Evaluator as a Transformer issue, I do
think it's fundamentally different in terms of type (since it gives a scalar
value) and in terms of concepts (and deserves its own class since evaluation is
such a key part of stats/ML).
> Even slightly complicated pipelines are tough to write. For example, if I
have an Estimator that takes an Iterator[DataSet] and produces a Model, it is
difficult to express in this framework.
What @mengxr mentioned about more complex Pipelines will help. But this
also touches on the concept of a stateful Estimator, which would help a lot
with your pipeline using an Iterator[Dataset]. I strongly believe we'll need
stateful Estimators in the future: definitely for Streaming, and possibly for
exposing iterative estimation algorithms. I have something for the latter
(iterative algs) in my class hierarchy prototype...as soon as I can finish it.
> 7. I still think getting rid of setters and just using constructor
arguments would simplify things here. As new versions of the same PipelineNode
with more options get added, we'd need to add additional constructors and
support calling in the old way for API compatibility - it's not pretty but I
think it's better than the current proposal.
Along with what @mengxr said, this kind of parameter bloat in constructors
is also a major problem for ensembles. Gradient boosting + decision trees
gives a lot more than the 10-parameter limit from scalastyle.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]