[jira] [Commented] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries

Peter Knight (JIRA) Wed, 04 Jul 2018 03:09:57 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532576#comment-16532576
 ]


Peter Knight commented on SPARK-19498:
--------------------------------------

Just wanted to add to Lucas' post that I agree with them. I love the ML 
Pipeline concept but it is currently verbose to create a custom transformer. 
Here is an example of number 3 showing the amount of repeated/boilerplate code 
needed.  You can see that the default values are being set 3 times for each 
parameter and the parameter names are being entered 10 times each!
{code}
from pyspark import keyword_only
from pyspark.ml.param.shared import Param, Params, TypeConverters
from pyspark.ml import Transformer

class StraightLine(Transformer):
        @keyword_only 
        def __init__(self, inputCol=None, outputCol=None, m=1.0, c=0.0): 
                super(StraightLine, self).__init__() 
                self._setDefault(inputCol=None, outputCol=None, m=1.0, c=0.0)
                kwargs = self._input_kwargs 
                self.setParams(**kwargs) 
        @keyword_only 
        def setParams(self, inputCol=None, outputCol=None, m=1.0, c=0.0): 
                kwargs = self._input_kwargs 
                return self._set(**kwargs) 
                
        # inputCol Param        
        inputCol = Param(Params._dummy(), "inputCol", "specify the input column 
name (your X). (string)", typeConverter=TypeConverters.toString)
        def setInputCol(self, value):
                return self._set(inputCol=value)
        def getInputCol(self):
                return self.getOrDefault(self.inputCol)
                
        # outputCol Param       
        outputCol = Param(Params._dummy(), "outputCol", "specify the output 
column name (your Y). (string)", typeConverter=TypeConverters.toString)
        def setOutputCol(self, value):
                return self._set(outputCol=value)
        def getOutputCol(self):
                return self.getOrDefault(self.outputCol)

        # m Param       
        m = Param(Params._dummy(), "m", "specify m - the slope of the line. 
(float)", typeConverter=TypeConverters.toFloat)
        def setM(self, value):
                return self._set(m=value)
        def getM(self):
                return self.getOrDefault(self.m)                 

        # c Param       
        c = Param(Params._dummy(), "c", "specify c - the y offset when x = 0. 
(float)", typeConverter=TypeConverters.toFloat)
        def setC(self, value):
                return self._set(c=value)
        def getC(self):
                return self.getOrDefault(self.c)

        # Define the Transformer
        def _transform(self, dataset):
                
                # get all the lists
                input_col = self.getInputCol()
                if not input_col:
                    raise Exception("inputCol not supplied")
                    
                output_col = self.getOutputCol()
                if not output_col:
                        raise Exception("outputCol not supplied")
                        
                return dataset.selectExpr("*", str(self.getM()) + " * " + 
input_col + " + " + str(self.getC()) + " AS " + output_col) 
{code}
My preference would be to have an function: 
addParam(name,description,type,default_value,is_required) which would make the 
code look more like the code below (gone form 50+ lines to <10). (is_required 
would default to true, and would throw an error if that parameter were None). 
If explainParams also showed you the data type expected then I wouldn't have to 
add that to the description myself each time. If not easy to add getters and 
setter a generic getParam(name) would do. 
{code}
class StraightLine(Transformer):
        addParam("inputCol", "specify the input column name (your X).", String, 
None)
        addParam("outputCol", "specify the output column name (your Y).", 
String, None)
        addParam("m", "specify m - the slope of the line.", Float, 1.0)
        addParam("c", "specify c - the y offset when x = 0.", Float, 0.0)

        def _transform(self, dataset):
                return dataset.selectExpr("*", str(self.getM()) + " * " + 
self.getInputCol() + " + " + str(self.getC()) + " AS " + self.getOutputCol())
{code}

> Discussion: Making MLlib APIs extensible for 3rd party libraries
> ----------------------------------------------------------------
>
>                 Key: SPARK-19498
>                 URL: https://issues.apache.org/jira/browse/SPARK-19498
>             Project: Spark
>          Issue Type: Brainstorming
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: Joseph K. Bradley
>            Priority: Critical
>
> Per the recent discussion on the dev list, this JIRA is for discussing how we 
> can make MLlib DataFrame-based APIs more extensible, especially for the 
> purpose of writing 3rd-party libraries with APIs extended from the MLlib APIs 
> (for custom Transformers, Estimators, etc.).
> * For people who have written such libraries, what issues have you run into?
> * What APIs are not public or extensible enough?  Do they require changes 
> before being made more public?
> * Are APIs for non-Scala languages such as Java and Python friendly or 
> extensive enough?
> The easy answer is to make everything public, but that would be terrible of 
> course in the long-term.  Let's discuss what is needed and how we can present 
> stable, sufficient, and easy-to-use APIs for 3rd-party developers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries

Reply via email to