[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

BryanCutler Tue, 29 May 2018 10:37:59 -0700

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r191511343
  
    --- Diff: python/pyspark/worker.py ---
    @@ -111,9 +114,16 @@ def wrapped(key_series, value_series):
                     "Number of columns of the returned pandas.DataFrame "
                     "doesn't match specified schema. "
                     "Expected: {} Actual: {}".format(len(return_type), 
len(result.columns)))
    -        arrow_return_types = (to_arrow_type(field.dataType) for field in 
return_type)
    -        return [(result[result.columns[i]], arrow_type)
    -                for i, arrow_type in enumerate(arrow_return_types)]
    +        try:
    +            # Assign result columns by schema name
    +            return [(result[field.name], to_arrow_type(field.dataType)) 
for field in return_type]
    +        except KeyError:
    +            if all(not isinstance(name, basestring) for name in 
result.columns):
    +                # Assign result columns by position if they are not named 
with strings
    +                return [(result[result.columns[i]], 
to_arrow_type(field.dataType))
    +                        for i, field in enumerate(return_type)]
    +            else:
    +                raise
    --- End diff --
    
    @viirya I think that it's just that it is very common for users to create a 
DataFrame with a dict using names as keys and not know that this can change the 
order of columns.  So even if the field types all match (in the case of this 
JIRA they were all StringTypes), there could be a mix up between the data and 
column names.  This is really weird and hard to figure out what is going on 
from the user perspective.
    
    When defining the pandas_udf, the return type requires the field names, so 
if the returned DataFrame has columns indexed by strings, I think it's fair to 
assume that if they do not match it was a mistake.  If the user wants to use 
positional columns, they can index by integers - and I'll add this to the docs.
    
    That being said, I do suppose that this slightly changes the behavior if by 
chance the user had gone out of the way to make a pandas_udf by specifying 
columns with different names than the return type schema, but still with the 
same field type order.  That seems pretty unlikely to me though.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Reply via email to