[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

jkbradley Fri, 07 Nov 2014 12:35:17 -0800

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3099#issuecomment-62207734
  
    I'd like to second the thanks to you both for trying out the new API!  Some 
thoughts:
    
    @shivaram About your Dataset API comments:
    As @mengxr said, I am planning several abstractions which should help with 
the boilerplate.  I agree with your proposal for sticking with familiar 
RDD[MyType] API where possible and letting abstractions handle the boilerplate 
of working with SchemaRDD.  When that is not possible, I still hope to provide 
some helper functions to reduce boilerplate.
    I have this set of classes partly sketched out and will send a WIP PR once 
this PR gets merged.
    
    @shivaram About your Pipelines API comments:
    * Loops in a pipeline: What @mengxr suggested might work for the FFT thing, 
but general Pipelines with cycles, etc. are definitely future work.
    * Parameters vs. Constructors: Instinctively, I agree about having at least 
some parameters specified in a constructor, especially when they are required 
parameters (e.g., the Estimator for CrossValidation).  However, @mengxr 
convinced me that it makes things difficult.  E.g., for CrossValidation, you 
really don't want a CV instance to be tied to a particular estimator since you 
may want to run CV to choose between several Estimators.
    * Chaining evaluators to a Pipeline: Initially, the 2 ways to get 
evaluations will be to look at Transformers created by fitting Estimators (to 
see training evaluation metrics) and to compute metrics on your own using the 
new columns in the SchemaRDD produced by transform (to get test metrics).  
Later on, it would be great to allow users to insert Evaluators into Pipelines, 
to compute custom metrics more easily.
    
    @tomerk  About a few comments:
    * "There are a lot of parameter traits": I too am ambivalent here.  It may 
save a little code duplication, but may also discourage people from writing 
customized documentation.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Reply via email to