Github user etrain commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62497848
Hi guys - I spent a good deal of time today writing a new pipeline against
this PR, and currently find it pretty difficult to use:
Some issues:
1. LOTS of boilerplate - the UnaryTransformer is nice, but in many cases
we've got non-unary transformers, and the setters are messy to write and
non-intuitive. Even in UnitaryTransformers - this ClassTag thing doesn't make a
lot of sense and seems unnecessary from a user's perspective. The
transform(schema, paramMap) stuff is all basically boilerplate with just some
names changed, and could be handled by the type system. As is the code which
packs the estimator into a model, which is needed basically with every new
model family.
2. Run-time typechecking already bit me pretty hard - for example, i
copy/pasted the line "randomSignNode.setOutputCol("fftFeatures")" when I meant
to write
"fftTransform.setOutputCol("fftFeatures")" and this didn't get caught until
runtime and the stack was not very helpful for debugging.
3. Serialization - it's fundamentally pretty tough to figure out what's
actually getting serialized into closures. This is not a unique problem here,
but with the rather extensive inheritance structure we're imposing, it makes it
harder to track serialization issues down and debug them.
4. Tuning - while I appreciate the focus on supporting parameter tuning in
a first release, I think it's making this PR more complicated than it needs to
be - e.g. removing it would obviate the need for an Evaluator which might make
things simpler (an evaluator could be just a transformer).
5. Even slightly complicated pipelines are tough to write. For example, if
I have an Estimator that takes an Iterator[DataSet] and produces a Model, it is
difficult to express in this framework.
6. Various classes/traits that user code effectively needs are marked
private - VectorUDT, Has*Col, etc.
7. I still think getting rid of setters and just using constructor
arguments would simplify things here. As new versions of the same PipelineNode
with more options get added, we'd need to add additional constructors and
support calling in the old way for API compatibility - it's not pretty but I
think it's better than the current proposal.
8. Why can't model just be a transformer? Because it needs to know what
called it?
9. In many cases, we don't want our batch operation to be a .map on every
row of the RDD - instead we want to write batch operators on the full
partitions for efficiency purposes - the UnitaryTransformer interface makes it
difficult to chain operators like this together.
Overall - at this point everything outside of a very simple
UnaryTransformer is very difficult for humans to write - and even with that
there are a few things you have to get *just right* in order for everything to
compile and run.
I've shared a branch with the people on this thread that was an attempt to
rebuild @shivaram 's MNIST pipeline against the current interface.
(http://github.com/amplab/ml-pipelines) It compiles, but errors out on
serialization issues at runtime - and it's difficult to figure out why with the
current structure. You'll need rebuild spark with some fields no longer marked
private to get it to run. You can follow the instructions in the README here,
but you'll change the entry point to pipelines.MnistRandomFFTPipeline
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]