[GitHub] spark pull request: [SPARK-3530][MLLIB] pipeline and parameters wi...

mengxr Tue, 11 Nov 2014 01:12:53 -0800

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-62520491

@etrain Thanks for testing this PR!

1. LOTS of boilerplate - the UnaryTransformer is nice, but in many cases
we've got non-unary transformers, and the setters are messy to write and
non-intuitive. Even in UnitaryTransformers - this ClassTag thing doesn't make a
lot of sense and seems unnecessary from a user's perspective. The
transform(schema, paramMap) stuff is all basically boilerplate with just some
names changed, and could be handled by the type system. As is the code which
packs the estimator into a model, which is needed basically with every new
model family.

By "user", you actually meant "developer", who creates new estimators and
transformers. User doesn't need to touch `UnaryTransformer`. As @mateiz
mentioned, the main purpose of this PR is to propose a set of APIs that help
users build and tune ML pipelines quickly. There are certainly some boilerplate
code at this time. Some are necessary to support Java users, and hence worth a
few extra lines of code. If you compare the code in
`JavaSimpleTextClassificationPipeline` and `SimpleTextClassificationPipeline`,
they look quite similar. And this is certainly better than asking Java users to
cast `Object` into `Double`. Some boilerplate code are due to lack of tools on
the development side. For example, we added UDT support, and then we can use
dense/sparse vector freely with SchemaRDD. I believe there are many
improvements we can do on the development side, e.g., better class hierarchy
for learners and models, more SQL Scala DSL, etc. I think what we should do is
to understand more ab
out what we need and build tools to simplify the code, where we definitely
need feedback and see requirements from users.

> 2. Run-time typechecking already bit me pretty hard - for example, i
copy/pasted the line "randomSignNode.setOutputCol("fftFeatures")" when I meant
to write
"fftTransform.setOutputCol("fftFeatures")" and this didn't get caught until
runtime and the stack was not very helpful for debugging.

There are certainly both pros and cons for strong type and weak type. We
try to detect errors early at runtime with the validation and transformation
code on the schema layer before we touch the data. We definitely need tools,
e.g., matchers, to simplify the type checks here. But I don't view this as a
serious issue. Even with strong-type, we still need precondition checks for
most methods, like non-negativity, dimension mismatch, etc.

> 3. Serialization - it's fundamentally pretty tough to figure out what's
actually getting serialized into closures. This is not a unique problem here,
but with the rather extensive inheritance structure we're imposing, it makes it
harder to track serialization issues down and debug them.

Agree as I'm trying to figure out the unit test failure on Jenkins. We need
to learn how to avoid this no matter whether there is only a single class or
one extends many. The spores project may help this in the future.

> 4. Tuning - while I appreciate the focus on supporting parameter tuning
in a first release, I think it's making this PR more complicated than it needs
to be - e.g. removing it would obviate the need for an Evaluator which might
make things simpler (an evaluator could be just a transformer).

I feel tuning is essential. We can discuss more about the API or make
incremental changes after. Again, this is definitely not something set in
stone. If an evaluator is a transformer, what should it output?

> 5. Even slightly complicated pipelines are tough to write. For example,
if I have an Estimator that takes an Iterator[DataSet] and produces a Model, it
is difficult to express in this framework.

This is beyond this PR. As we discussed offline, pipeline is a simple
sequence of operations. We could add DAG in the future, where a component may
consume multiple input sources.

> 6. Various classes/traits that user code effectively needs are marked
private - VectorUDT, Has*Col, etc.

The UDT feature needs more testing, and people have different opinions on
the param traits (see previous comments). For now, you could develop code under
"org.apache.spark.ml".

> 7. I still think getting rid of setters and just using constructor
arguments would simplify things here. As new versions of the same PipelineNode
with more options get added, we'd need to add additional constructors and
support calling in the old way for API compatibility - it's not pretty but I
think it's better than the current proposal.

We still need setters for optional params. If we put everything as
constructor arguments, it would be really painful for Java users due to lack of
named argument feature and default values. This is similar to the static
methods we have in MLlib. For example, in ALS, we have 150 lines of code for
`ALS.train` and I bet no one remembers the ordering of the arguments:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L650

> 8. Why can't model just be a transformer? Because it needs to know what
called it?

Yes, it is nice to remember where it comes from.

> 9. In many cases, we don't want our batch operation to be a .map on every
row of the RDD - instead we want to write batch operators on the full
partitions for efficiency purposes - the UnitaryTransformer interface makes it
difficult to chain operators like this together.

For now, we can extract the column, apply the partition-wise
transformation, and then zip the output column back. We may need tools on the
SQL side to optimize this operation. But this is feasible.

Thanks for your useful feedback! Again, please understand that this PR is
not to address every aspect of ML pipelines, which may be still challenging
after a few releases.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3530][MLLIB] pipeline and parameters wi...

Reply via email to