[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

mengxr Fri, 07 Nov 2014 11:38:18 -0800

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/3099#issuecomment-62199983
  
    @tomerk 
    
    > At @shivaram's suggestion, I started porting over a simple text 
classifier pipeline that was already using an Estimator/Transformer abstraction 
of RDD[U] to RDD[V] transforms to this interface. The almost-complete port (the 
imports got messed up when moving files around) can be found at 
shivaram/spark-ml@522aec7.
    
    Thanks for testing on the newsgroup20 pipeline!
    
    > The trickiest bit by far was all of the implicit conversions. I ended up 
needing to use several types of implicit conversion imports (case class -> 
schema RDD, spark sql dsl, parameter map, etc.) They also got mysteriously 
deleted by the IDE as I moved files between projects. I ended up having to copy 
and paste these whenever appropriate because I couldn't keep track of them.
    
    I removed implicitMapping because I found that `map(inputCols)` is actually 
shorter than `(inputCols: String)`. For a Scala IDE, I don't think we can trust 
it translating code.
    
    > Like Shivaram, I'm also not familiar with the Spark SQL dsl, so here I 
also had to copy and paste code. It's unclear what syntax is valid and what 
isn't. For example, is saying "as outputCol" enough, or is "as 
Symbol(outputCol)" required?
    
    `as(String)` is recently added to Spark master. If you are not on the 
latest version,  you need `as Symbol(outputCol)`.
    
    > There is a lot of boilerplate code. It was easier to write the 
Transformers in the form RDD[U] to RDD[V] instead of SchemaRDD to SchemaRDD, so 
I fully agree with Shivaram on that front. Potentially, certain interfaces 
along those lines (iterator to iterator transformers that can be applied to 
RDDs using mappartitions) could make it easier to have transformers not depend 
on local Spark Contexts to execute.
    
    We will keep that option. Note that the strong-type approach will face 
troubles when we need to deal with extra columns, e.g, event ids, or support 
more features later, e.g., support weighted instances.
    
    > I found the parameter mapping in estimators fairly verbose, I like 
Shivaram's idea of having the estimators pass everything to the transformers no 
matter what.
    
    Yes, I like that idea too.
    
    > Estimators requiring the transformers they output to extend Model didn't 
make sense to me. Certain estimators, such as to choose only the most frequent 
tokens in a collection to keep for each document, don't seem like they should 
output models. On that front, should it be required for estimators to specify 
the type of transformer they output? It can be convenient sometimes to just 
inline an anonymous Transformer to output without making it a top-level class.
    
    The set of most frequent tokens could be viewed as a descriptive model in 
your case. I did try estimators without generic model types, but I don't 
remember what went wrong and made me switch to generic.
    
    > There are a lot of parameter traits: HasRegParam, HasMaxIter, 
HasScoreCol, HasFeatureCol.... Does it make sense to have this many specific 
parameter traits if we still have to maintain boilerplate setters code for Java 
anyway?
    
    I have no preference on this one. I put those before I realize the problem 
with Java. I marked those as `private[ml]` in the current version and I'm okay 
to remove them completely.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Reply via email to