[GitHub] spark pull request: [SPARK-3530][MLLIB] pipeline and parameters wi...

etrain Mon, 10 Nov 2014 19:42:50 -0800

Github user etrain commented on the pull request:

    https://github.com/apache/spark/pull/3099#issuecomment-62497848
  
    Hi guys - I spent a good deal of time today writing a new pipeline against 
this PR, and currently find it pretty difficult to use:
    
    Some issues:
    1. LOTS of boilerplate - the UnaryTransformer is nice, but in many cases 
we've got non-unary transformers, and the setters are messy to write and 
non-intuitive. Even in UnitaryTransformers - this ClassTag thing doesn't make a 
lot of sense and seems unnecessary from a user's perspective. The 
transform(schema, paramMap) stuff is all basically boilerplate with just some 
names changed, and could be handled by the type system. As is the code which 
packs the estimator into a model, which is needed basically with every new 
model family.
    2. Run-time typechecking already bit me pretty hard - for example, i 
copy/pasted the line "randomSignNode.setOutputCol("fftFeatures")" when I meant 
to write 
    "fftTransform.setOutputCol("fftFeatures")" and this didn't get caught until 
runtime and the stack was not very helpful for debugging.
    3. Serialization - it's fundamentally pretty tough to figure out what's 
actually getting serialized into closures. This is not a unique problem here, 
but with the rather extensive inheritance structure we're imposing, it makes it 
harder to track serialization issues down and debug them. 
    4. Tuning - while I appreciate the focus on supporting parameter tuning in 
a first release, I think it's making this PR more complicated than it needs to 
be - e.g. removing it would obviate the need for an Evaluator which might make 
things simpler (an evaluator could be just a transformer).
    5. Even slightly complicated pipelines are tough to write. For example, if 
I have an Estimator that takes an Iterator[DataSet] and produces a Model, it is 
difficult to express in this framework. 
    6. Various classes/traits that user code effectively needs are marked 
private - VectorUDT, Has*Col, etc.  
    7. I still think getting rid of setters and just using constructor 
arguments would simplify things here. As new versions of the same PipelineNode 
with more options get added, we'd need to add additional constructors and 
support calling in the old way for API compatibility - it's not pretty but I 
think it's better than the current proposal.
    8. Why can't model just be a transformer? Because it needs to know what 
called it?
    9. In many cases, we don't want our batch operation to be a .map on every 
row of the RDD - instead we want to write batch operators on the full 
partitions for efficiency purposes - the UnitaryTransformer interface makes it 
difficult to chain operators like this together.
    
    Overall - at this point everything outside of a very simple 
UnaryTransformer is very difficult for humans to write - and even with that 
there are a few things you have to get *just right* in order for everything to 
compile and run.
    
    I've shared a branch with the people on this thread that was an attempt to 
rebuild @shivaram 's MNIST pipeline against the current interface. 
(http://github.com/amplab/ml-pipelines) It compiles, but errors out on 
serialization issues at runtime - and it's difficult to figure out why with the 
current structure. You'll need rebuild spark with some fields no longer marked 
private to get it to run. You can follow the instructions in the README here, 
but you'll change the entry point to pipelines.MnistRandomFFTPipeline



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3530][MLLIB] pipeline and parameters wi...

Reply via email to