[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

BryanCutler Fri, 22 Jun 2018 10:01:38 -0700

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r197508262
  
    --- Diff: python/pyspark/worker.py ---
    @@ -110,9 +116,20 @@ def wrapped(key_series, value_series):
                     "Number of columns of the returned pandas.DataFrame "
                     "doesn't match specified schema. "
                     "Expected: {} Actual: {}".format(len(return_type), 
len(result.columns)))
    -        arrow_return_types = (to_arrow_type(field.dataType) for field in 
return_type)
    -        return [(result[result.columns[i]], arrow_type)
    -                for i, arrow_type in enumerate(arrow_return_types)]
    +
    +        if not assign_cols_by_pos:
    +            try:
    +                # Assign result columns by schema name
    +                return [(result[field.name], to_arrow_type(field.dataType))
    +                        for field in return_type]
    +            except KeyError:
    --- End diff --
    
    This seems ok to me since it's basically the same, but I don't think we 
need to worry about `to_arrow_type` throwing a `KeyError`.  Is there any 
particular reason you suggested handling position like this?
    
    ```
    [(result.iloc[:,i], to_arrow_type(field.dataType)) for i, field in 
enumerate(return_type)]
    ```
    
    To me it seems better to look up by column labels, how it is currently
    
    ```
    [(result[result.columns[i]], to_arrow_type(field.dataType))
                    for i, field in enumerate(return_type)]
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Reply via email to