Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/3099#issuecomment-62105053
  
    @mengxr @jkbradley  I tried to port one of our simple image processing 
pipelines to the new interface today and the code for this is at 
https://github.com/shivaram/spark-ml/blob/master/src/main/scala/ml/MnistRandomFFTPipeline.scala
 . Note that I tried to write this as an application layer on top of Spark as I 
think our primary goal is to enable custom pipelines, not just those using our 
provided components. I had some thoughts about the Pipelines and Dataset APIs 
while writing this code and I've detailed them below. It'll be great if you 
could also look over the code to see if I am missing something.
    
    #### Dataset API
    I should preface this section by saying that my knowledge of SparkSQL and 
SchemaRDD is very limited. I haven't written much code using this and I 
referred to your code + the programming guide for most part. But as the SQL 
component itself is pretty new, I think my skill set maybe representative of 
the average Spark application developer. That said please let me know if I 
missed something obvious.
    
    1. Because we are using SchemaRDDs directly I found that there is 5-10 
lines of boiler plate code in every transform or fit function. This is usually 
selecting one or more input columns and getting out an RDD which can be 
processed and then adding an output column at the end. It would be good if we 
can have wrappers which do this automatically (I have one proposal below)
    
    2. The UDF interface or Row => Row feels pretty restrictive. For numerical 
computations I often want to do things like mapPartitions or use broadcast 
variables etc. In those cases I am not sure how to directly use UDFs.
    
    3. And the main reason I used UDFs was to handle the semantics of appending 
an output column. Is there any other API other than using `select Star(None) 
...` ? It'd be great if we had something like dataset.zip(<RDD containing new 
column>) which is easier / more flexible ?
    
    One proposal I had for simplifying things is to define another transformer 
API which is less general but easier to use (we can keep the existing ones as 
base classes). This API might look something like this 
    ``` 
      abstract class SimpleTransformer[U, V] extends Transformer with 
HasInputCol with HasOutputCol {
         def transform(RDD[U]): RDD[V] 
      }
    ```
    We will then have wrapper code that will select the input col, cast it to 
`U` and similar take `V` add it as outpuCol to the dataset. There are a couple 
of advantages with this: First is that the RDD API is flexible, well documented 
and well understood by programmers. Secondly this also provides a good boundary 
for unit testing and gives the programmer some type safety within their 
transformer.
    
    #### Pipelines API
    
    1. Multiple batches of transformers -- One of the main problems I ran into 
was that I couldn't see an easy way to have loops inside a pipeline. By loops I 
mean repeating some part of the pipeline many times. In this example I wanted 
to run RandomSignNode followed by FFT a bunch of times and then concatenate 
their output as features for Regression. There is a similar pattern in the 
pipeline described in [Evan's blog 
post](https://amplab.cs.berkeley.edu/2014/09/24/4643/)
    
    2. Passing parameters in Estimators -- I finally understood the problem you 
were describing originally ! However I solved it slightly differently that 
having 2 sets of parameters (This is in 
MultiClassLinearRegressionEstimator.scala). I made parent Estimator class take 
in as params all the parameters required for the child Transformer class as 
well and then passed them along. I thought this was cleaner than having 
modelParams and params as two different members.
    
    3. Parameters vs. Constructors -- While I agree with your comment that 
constructor arguments make binary compatibility tricky, I ran into a couple of 
cases where creating a transformer without a particular argument doesn't make 
sense. Will we have a guideline that things which won't vary should be 
constructor arguments ? I think it'll be good to come up with some distinction 
that makes it clear for the programmer.
    
    4. Chaining evaluators to a Pipeline -- I wrote a simple evaluator that 
computed test error at the end but wasn't sure how to chain this to a pipeline. 
Is this something we don't intend to support ?
    
    5. Strings, Vectors etc --  There were a couple of minor things that 
weren't ideal, but I think we will have to live with. With every node having an 
input and output column, there are lots of strings floating around which the 
user needs to get right. Also going from Spark's Vector / Matrix classes to 
Breeze and back to Spark classes gets tedious. We could add commonly used 
operators like add, multiply etc. to our public APIs and that might help a bit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to