Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62105053
@mengxr @jkbradley I tried to port one of our simple image processing
pipelines to the new interface today and the code for this is at
https://github.com/shivaram/spark-ml/blob/master/src/main/scala/ml/MnistRandomFFTPipeline.scala
. Note that I tried to write this as an application layer on top of Spark as I
think our primary goal is to enable custom pipelines, not just those using our
provided components. I had some thoughts about the Pipelines and Dataset APIs
while writing this code and I've detailed them below. It'll be great if you
could also look over the code to see if I am missing something.
#### Dataset API
I should preface this section by saying that my knowledge of SparkSQL and
SchemaRDD is very limited. I haven't written much code using this and I
referred to your code + the programming guide for most part. But as the SQL
component itself is pretty new, I think my skill set maybe representative of
the average Spark application developer. That said please let me know if I
missed something obvious.
1. Because we are using SchemaRDDs directly I found that there is 5-10
lines of boiler plate code in every transform or fit function. This is usually
selecting one or more input columns and getting out an RDD which can be
processed and then adding an output column at the end. It would be good if we
can have wrappers which do this automatically (I have one proposal below)
2. The UDF interface or Row => Row feels pretty restrictive. For numerical
computations I often want to do things like mapPartitions or use broadcast
variables etc. In those cases I am not sure how to directly use UDFs.
3. And the main reason I used UDFs was to handle the semantics of appending
an output column. Is there any other API other than using `select Star(None)
...` ? It'd be great if we had something like dataset.zip(<RDD containing new
column>) which is easier / more flexible ?
One proposal I had for simplifying things is to define another transformer
API which is less general but easier to use (we can keep the existing ones as
base classes). This API might look something like this
```
abstract class SimpleTransformer[U, V] extends Transformer with
HasInputCol with HasOutputCol {
def transform(RDD[U]): RDD[V]
}
```
We will then have wrapper code that will select the input col, cast it to
`U` and similar take `V` add it as outpuCol to the dataset. There are a couple
of advantages with this: First is that the RDD API is flexible, well documented
and well understood by programmers. Secondly this also provides a good boundary
for unit testing and gives the programmer some type safety within their
transformer.
#### Pipelines API
1. Multiple batches of transformers -- One of the main problems I ran into
was that I couldn't see an easy way to have loops inside a pipeline. By loops I
mean repeating some part of the pipeline many times. In this example I wanted
to run RandomSignNode followed by FFT a bunch of times and then concatenate
their output as features for Regression. There is a similar pattern in the
pipeline described in [Evan's blog
post](https://amplab.cs.berkeley.edu/2014/09/24/4643/)
2. Passing parameters in Estimators -- I finally understood the problem you
were describing originally ! However I solved it slightly differently that
having 2 sets of parameters (This is in
MultiClassLinearRegressionEstimator.scala). I made parent Estimator class take
in as params all the parameters required for the child Transformer class as
well and then passed them along. I thought this was cleaner than having
modelParams and params as two different members.
3. Parameters vs. Constructors -- While I agree with your comment that
constructor arguments make binary compatibility tricky, I ran into a couple of
cases where creating a transformer without a particular argument doesn't make
sense. Will we have a guideline that things which won't vary should be
constructor arguments ? I think it'll be good to come up with some distinction
that makes it clear for the programmer.
4. Chaining evaluators to a Pipeline -- I wrote a simple evaluator that
computed test error at the end but wasn't sure how to chain this to a pipeline.
Is this something we don't intend to support ?
5. Strings, Vectors etc -- There were a couple of minor things that
weren't ideal, but I think we will have to live with. With every node having an
input and output column, there are lots of strings floating around which the
user needs to get right. Also going from Spark's Vector / Matrix classes to
Breeze and back to Spark classes gets tedious. We could add commonly used
operators like add, multiply etc. to our public APIs and that might help a bit.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]