[ 
https://issues.apache.org/jira/browse/SPARK-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16527351#comment-16527351
 ] 

Lucas Partridge commented on SPARK-19498:
-----------------------------------------

Ok great. Here's me feedback after wrapping a large complex Python algorithm 
for ML Pipeline on Spark 2.2.0. Several of these comments probably apply beyond 
pyspark too.
 # The inability to save and load custom pyspark 
models/pipelines/pipelinemodels is an absolute showstopper. Training models can 
take hours and so we need to be able to save and reload models. Pending the 
availability of https://issues.apache.org/jira/browse/SPARK-17025 I used a 
refinement of [https://stackoverflow.com/a/49195515/1843329] to work around 
this. Had this not been solved no further work would have been done.
 # Support for saving/loading more param types would be great. I had to use 
json.dumps to convert our algorithm's internal model into a string and then 
pretend it was a string param in order to save and load that with the rest of 
the transformer.
 # Given that pipelinemodels can be saved we also need the ability to export 
them easily for deployment on other clusters. The cluster where you train the 
model may be different to the one where you deploy it for predictions. A hack 
workaround is to use hdfs commands to copy the relevant files and directories 
but it would be great if we had simple single export/import commands in pyspark 
to move models.pipelines/pipelinemodels easily between clusters and to allow 
artifacts to be stored off-cluster.
 # Creating individual parameters with getters and setters is tedious and 
error-prone, especially if writing docs inline too. It would be great if as 
much of this boiler-plate as possible could be auto-generated from a simple 
parameter definition. I always groan when someone asks for an extra param at 
the moment!
 # The Ml Pipeline API seems to assume all the params lie on the estimator and 
none on the transformer. In the algorithm I wrapped the model/transformer has 
numerous params that are specific to it rather than the estimator. 
PipelineModel needs a getStages() command (just as Pipeline does) to get at the 
model so you can parameterise it. I had to use the undocumented .stages member 
instead.  But then if you want to call transform() on a pipelinemodel 
immediately after fitting it you also need some ability to set the 
model/transformer params in advance. I got around this by defining a params 
class for the estimator-only params and another class for the model-only 
params. I made the estimator inherit from both these classes and the model 
inherit from only the model-base params class. The estimator then just passes 
through any model-specific params to the model when it creates it at the end of 
its fit() method. But, to distinguish the model-only params from the estimator 
(e.g., when listing the params on the estimator) I had to prefix all the 
model-only params with a common value to identify them. This works but it's 
clunky and ugly.
 # The algorithm I ported works naturally with individually named column 
inputs. But the existing ML Pipeline library prefers DenseVectors. I ended up 
having to support both types of inputs - if a DenseVector input was 'None' I 
would take the data directly from the individually named  columns instead. If 
users want to use the algorithm by itself they can used the column-based input 
approach; if they want to work with algorithms from the built-in library (e.g., 
StandardScaler, Binarizer, etc) they can use the DenseVector approach instead.  
Again this works but is clunky because you're having to handle two different 
forms of input inside the same implementation. Also DenseVectors are limited by 
their inability to handle missing values.
 # Similarly, I wanted to produce multiple separate columns for the outputs of 
the model's transform() method whereas most built-in algorithms seem to use a 
single DenseVector output column. DataFrame's withColumn() method could do with 
a withColumns() equivalent to make it easy to add multiple columns to a 
Dataframe instead of just one column at a time.
 # Documentation explaining how to create a custom estimator and transformer 
(preferably one with transformer-specific params) would be extremely useful for 
people. Most of what I learned I gleaned off StackOverflow and from looking at 
Spark's pipeline code.

Hope this list will be useful for improving ML Pipelines in future versions of 
Spark!

> Discussion: Making MLlib APIs extensible for 3rd party libraries
> ----------------------------------------------------------------
>
>                 Key: SPARK-19498
>                 URL: https://issues.apache.org/jira/browse/SPARK-19498
>             Project: Spark
>          Issue Type: Brainstorming
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: Joseph K. Bradley
>            Priority: Critical
>
> Per the recent discussion on the dev list, this JIRA is for discussing how we 
> can make MLlib DataFrame-based APIs more extensible, especially for the 
> purpose of writing 3rd-party libraries with APIs extended from the MLlib APIs 
> (for custom Transformers, Estimators, etc.).
> * For people who have written such libraries, what issues have you run into?
> * What APIs are not public or extensible enough?  Do they require changes 
> before being made more public?
> * Are APIs for non-Scala languages such as Java and Python friendly or 
> extensive enough?
> The easy answer is to make everything public, but that would be terrible of 
> course in the long-term.  Let's discuss what is needed and how we can present 
> stable, sufficient, and easy-to-use APIs for 3rd-party developers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to