[GitHub] spark pull request #19630: [SPARK-22409] Introduce function type argument in...

viirya Wed, 15 Nov 2017 23:17:01 -0800

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19630#discussion_r151333891
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2208,26 +2089,39 @@ def udf(f=None, returnType=StringType()):
         |         8|      JOHN DOE|          22|
         +----------+--------------+------------+
         """
    -    return _create_udf(f, returnType=returnType, 
pythonUdfType=PythonUdfType.NORMAL_UDF)
    +    # decorator @udf, @udf(), @udf(dataType())
    +    if f is None or isinstance(f, (str, DataType)):
    +        # If DataType has been passed as a positional argument
    +        # for decorator use it as a returnType
    +        return_type = f or returnType
    +        return functools.partial(_create_udf, returnType=return_type,
    +                                 evalType=PythonEvalType.SQL_BATCHED_UDF)
    +    else:
    +        return _create_udf(f=f, returnType=returnType,
    +                           evalType=PythonEvalType.SQL_BATCHED_UDF)
     
     
     @since(2.3)
    -def pandas_udf(f=None, returnType=StringType()):
    +def pandas_udf(f=None, returnType=None, functionType=None):
         """
         Creates a vectorized user defined function (UDF).
     
         :param f: user-defined function. A python function if used as a 
standalone function
         :param returnType: a :class:`pyspark.sql.types.DataType` object
    +    :param functionType: an enum value in 
:class:`pyspark.sql.functions.PandasUdfType`.
    +                         Default: SCALAR.
     
    -    The user-defined function can define one of the following 
transformations:
    +    The function type of the UDF can be one of the following:
     
    -    1. One or more `pandas.Series` -> A `pandas.Series`
    +    1. SCALAR
     
    -       This udf is used with :meth:`pyspark.sql.DataFrame.withColumn` and
    -       :meth:`pyspark.sql.DataFrame.select`.
    +       A scalar UDF defines a transformation: One or more `pandas.Series` 
-> A `pandas.Series`.
            The returnType should be a primitive data type, e.g., 
`DoubleType()`.
            The length of the returned `pandas.Series` must be of the same as 
the input `pandas.Series`.
     
    +       Scalar UDFs are used with :meth:`pyspark.sql.DataFrame.withColumn` 
and
    +       :meth:`pyspark.sql.DataFrame.select`.
    +
            >>> from pyspark.sql.types import IntegerType, StringType
            >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
            >>> @pandas_udf(returnType=StringType())
    --- End diff --
    
    In this doctest, there are two pandas_udf. Please explicitly assign 
`PandasUDFType.SCALAR` as the `functionType` of one of udfs.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19630: [SPARK-22409] Introduce function type argument in...

Reply via email to