Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62207734
I'd like to second the thanks to you both for trying out the new API! Some
thoughts:
@shivaram About your Dataset API comments:
As @mengxr said, I am planning several abstractions which should help with
the boilerplate. I agree with your proposal for sticking with familiar
RDD[MyType] API where possible and letting abstractions handle the boilerplate
of working with SchemaRDD. When that is not possible, I still hope to provide
some helper functions to reduce boilerplate.
I have this set of classes partly sketched out and will send a WIP PR once
this PR gets merged.
@shivaram About your Pipelines API comments:
* Loops in a pipeline: What @mengxr suggested might work for the FFT thing,
but general Pipelines with cycles, etc. are definitely future work.
* Parameters vs. Constructors: Instinctively, I agree about having at least
some parameters specified in a constructor, especially when they are required
parameters (e.g., the Estimator for CrossValidation). However, @mengxr
convinced me that it makes things difficult. E.g., for CrossValidation, you
really don't want a CV instance to be tied to a particular estimator since you
may want to run CV to choose between several Estimators.
* Chaining evaluators to a Pipeline: Initially, the 2 ways to get
evaluations will be to look at Transformers created by fitting Estimators (to
see training evaluation metrics) and to compute metrics on your own using the
new columns in the SchemaRDD produced by transform (to get test metrics).
Later on, it would be great to allow users to insert Evaluators into Pipelines,
to compute custom metrics more easily.
@tomerk About a few comments:
* "There are a lot of parameter traits": I too am ambivalent here. It may
save a little code duplication, but may also discourage people from writing
customized documentation.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]