[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

icexelloss Fri, 22 Jun 2018 11:06:01 -0700

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r197525704
  
    --- Diff: python/pyspark/worker.py ---
    @@ -110,9 +116,20 @@ def wrapped(key_series, value_series):
                     "Number of columns of the returned pandas.DataFrame "
                     "doesn't match specified schema. "
                     "Expected: {} Actual: {}".format(len(return_type), 
len(result.columns)))
    -        arrow_return_types = (to_arrow_type(field.dataType) for field in 
return_type)
    -        return [(result[result.columns[i]], arrow_type)
    -                for i, arrow_type in enumerate(arrow_return_types)]
    +
    +        if not assign_cols_by_pos:
    +            try:
    +                # Assign result columns by schema name
    +                return [(result[field.name], to_arrow_type(field.dataType))
    +                        for field in return_type]
    +            except KeyError:
    --- End diff --
    
    I think `result.iloc[:,i]` and `result[result.columns[i]]` are the same, 
you don't have change it if you prefer `result.columns[i]`
    
    I agree `to_arrow_type` doesn't throw `KeyError`,  but in general I feel 
it's more robust not to assume the implementation detail of `to_arrow_type`. I 
think the code is more concise and readable with if/else too (comparing to 
except KeyError)



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Reply via email to