Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/21427#discussion_r197508262
--- Diff: python/pyspark/worker.py ---
@@ -110,9 +116,20 @@ def wrapped(key_series, value_series):
"Number of columns of the returned pandas.DataFrame "
"doesn't match specified schema. "
"Expected: {} Actual: {}".format(len(return_type),
len(result.columns)))
- arrow_return_types = (to_arrow_type(field.dataType) for field in
return_type)
- return [(result[result.columns[i]], arrow_type)
- for i, arrow_type in enumerate(arrow_return_types)]
+
+ if not assign_cols_by_pos:
+ try:
+ # Assign result columns by schema name
+ return [(result[field.name], to_arrow_type(field.dataType))
+ for field in return_type]
+ except KeyError:
--- End diff --
This seems ok to me since it's basically the same, but I don't think we
need to worry about `to_arrow_type` throwing a `KeyError`. Is there any
particular reason you suggested handling position like this?
```
[(result.iloc[:,i], to_arrow_type(field.dataType)) for i, field in
enumerate(return_type)]
```
To me it seems better to look up by column labels, how it is currently
```
[(result[result.columns[i]], to_arrow_type(field.dataType))
for i, field in enumerate(return_type)]
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]