Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/21427#discussion_r191511343
--- Diff: python/pyspark/worker.py ---
@@ -111,9 +114,16 @@ def wrapped(key_series, value_series):
"Number of columns of the returned pandas.DataFrame "
"doesn't match specified schema. "
"Expected: {} Actual: {}".format(len(return_type),
len(result.columns)))
- arrow_return_types = (to_arrow_type(field.dataType) for field in
return_type)
- return [(result[result.columns[i]], arrow_type)
- for i, arrow_type in enumerate(arrow_return_types)]
+ try:
+ # Assign result columns by schema name
+ return [(result[field.name], to_arrow_type(field.dataType))
for field in return_type]
+ except KeyError:
+ if all(not isinstance(name, basestring) for name in
result.columns):
+ # Assign result columns by position if they are not named
with strings
+ return [(result[result.columns[i]],
to_arrow_type(field.dataType))
+ for i, field in enumerate(return_type)]
+ else:
+ raise
--- End diff --
@viirya I think that it's just that it is very common for users to create a
DataFrame with a dict using names as keys and not know that this can change the
order of columns. So even if the field types all match (in the case of this
JIRA they were all StringTypes), there could be a mix up between the data and
column names. This is really weird and hard to figure out what is going on
from the user perspective.
When defining the pandas_udf, the return type requires the field names, so
if the returned DataFrame has columns indexed by strings, I think it's fair to
assume that if they do not match it was a mistake. If the user wants to use
positional columns, they can index by integers - and I'll add this to the docs.
That being said, I do suppose that this slightly changes the behavior if by
chance the user had gone out of the way to make a pandas_udf by specifying
columns with different names than the return type schema, but still with the
same field type order. That seems pretty unlikely to me though.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]