Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/3099#issuecomment-62520491
  
    @etrain Thanks for testing this PR!
    
    1. LOTS of boilerplate - the UnaryTransformer is nice, but in many cases 
we've got non-unary transformers, and the setters are messy to write and 
non-intuitive. Even in UnitaryTransformers - this ClassTag thing doesn't make a 
lot of sense and seems unnecessary from a user's perspective. The 
transform(schema, paramMap) stuff is all basically boilerplate with just some 
names changed, and could be handled by the type system. As is the code which 
packs the estimator into a model, which is needed basically with every new 
model family.
    
    By "user", you actually meant "developer", who creates new estimators and 
transformers. User doesn't need to touch `UnaryTransformer`. As @mateiz 
mentioned, the main purpose of this PR is to propose a set of APIs that help 
users build and tune ML pipelines quickly. There are certainly some boilerplate 
code at this time. Some are necessary to support Java users, and hence worth a 
few extra lines of code. If you compare the code in 
`JavaSimpleTextClassificationPipeline` and `SimpleTextClassificationPipeline`, 
they look quite similar. And this is certainly better than asking Java users to 
cast `Object` into `Double`. Some boilerplate code are due to lack of tools on 
the development side. For example, we added UDT support, and then we can use 
dense/sparse vector freely with SchemaRDD. I believe there are many 
improvements we can do on the development side, e.g., better class hierarchy 
for learners and models, more SQL Scala DSL, etc. I think what we should do is 
to understand more ab
 out what we need and build tools to simplify the code, where we definitely 
need feedback and see requirements from users.
    
    > 2. Run-time typechecking already bit me pretty hard - for example, i 
copy/pasted the line "randomSignNode.setOutputCol("fftFeatures")" when I meant 
to write 
    "fftTransform.setOutputCol("fftFeatures")" and this didn't get caught until 
runtime and the stack was not very helpful for debugging.
    
    There are certainly both pros and cons for strong type and weak type. We 
try to detect errors early at runtime with the validation and transformation 
code on the schema layer before we touch the data. We definitely need tools, 
e.g., matchers, to simplify the type checks here. But I don't view this as a 
serious issue. Even with strong-type, we still need precondition checks for 
most methods, like non-negativity, dimension mismatch, etc.
    
    > 3. Serialization - it's fundamentally pretty tough to figure out what's 
actually getting serialized into closures. This is not a unique problem here, 
but with the rather extensive inheritance structure we're imposing, it makes it 
harder to track serialization issues down and debug them. 
    
    Agree as I'm trying to figure out the unit test failure on Jenkins. We need 
to learn how to avoid this no matter whether there is only a single class or 
one extends many. The spores project may help this in the future.
    
    > 4. Tuning - while I appreciate the focus on supporting parameter tuning 
in a first release, I think it's making this PR more complicated than it needs 
to be - e.g. removing it would obviate the need for an Evaluator which might 
make things simpler (an evaluator could be just a transformer).
    
    I feel tuning is essential. We can discuss more about the API or make 
incremental changes after. Again, this is definitely not something set in 
stone. If an evaluator is a transformer, what should it output?
    
    > 5. Even slightly complicated pipelines are tough to write. For example, 
if I have an Estimator that takes an Iterator[DataSet] and produces a Model, it 
is difficult to express in this framework. 
    
    This is beyond this PR. As we discussed offline, pipeline is a simple 
sequence of operations. We could add DAG in the future, where a component may 
consume multiple input sources.
    
    > 6. Various classes/traits that user code effectively needs are marked 
private - VectorUDT, Has*Col, etc.
    
    The UDT feature needs more testing, and people have different opinions on 
the param traits (see previous comments). For now, you could develop code under 
"org.apache.spark.ml".
    
    > 7. I still think getting rid of setters and just using constructor 
arguments would simplify things here. As new versions of the same PipelineNode 
with more options get added, we'd need to add additional constructors and 
support calling in the old way for API compatibility - it's not pretty but I 
think it's better than the current proposal.
    
    We still need setters for optional params. If we put everything as 
constructor arguments, it would be really painful for Java users due to lack of 
named argument feature and default values. This is similar to the static 
methods we have in MLlib. For example, in ALS, we have 150 lines of code for 
`ALS.train` and I bet no one remembers the ordering of the arguments:
    
    
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L650
    
    > 8. Why can't model just be a transformer? Because it needs to know what 
called it?
    
    Yes, it is nice to remember where it comes from.
    
    > 9. In many cases, we don't want our batch operation to be a .map on every 
row of the RDD - instead we want to write batch operators on the full 
partitions for efficiency purposes - the UnitaryTransformer interface makes it 
difficult to chain operators like this together.
    
    For now, we can extract the column, apply the partition-wise 
transformation, and then zip the output column back. We may need tools on the 
SQL side to optimize this operation. But this is feasible.
    
    Thanks for your useful feedback! Again, please understand that this PR is 
not to address every aspect of ML pipelines, which may be still challenging 
after a few releases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to