[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

icexelloss Wed, 04 Oct 2017 07:52:29 -0700

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18732#discussion_r142693686
  
    --- Diff: python/pyspark/worker.py ---
    @@ -74,17 +74,35 @@ def wrap_udf(f, return_type):
     
     
     def wrap_pandas_udf(f, return_type):
    -    arrow_return_type = toArrowType(return_type)
    -
    -    def verify_result_length(*a):
    -        result = f(*a)
    -        if not hasattr(result, "__len__"):
    -            raise TypeError("Return type of pandas_udf should be a 
Pandas.Series")
    -        if len(result) != len(a[0]):
    -            raise RuntimeError("Result vector from pandas_udf was not the 
required length: "
    -                               "expected %d, got %d" % (len(a[0]), 
len(result)))
    -        return result
    -    return lambda *a: (verify_result_length(*a), arrow_return_type)
    +    if isinstance(return_type, StructType):
    +        arrow_return_types = [to_arrow_type(field.dataType) for field in 
return_type]
    +
    +        def fn(*a):
    --- End diff --
    
    `verify_result_type` is kind of a misnomer because this function does:
    
    1. convert the output of the user-defined function (pandas.DataFrame) to 
the form that the serialzer take (list of (pd.Series, DataType)) 
    
    2. Validate the return value of the user-defined function.
    
    Part of the verifying of the result type is done in the serializer in the 
process of coercing.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

Reply via email to