Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/21427
Here are some examples that currently work, but would no longer work under
the proposed fix. These are all cases where columns are named with strings, but
the names do not match the schema (let me know if I've missed any cases):
1) DataFrame constructed with a dict, where column order happens to match
field type order in schema
```python
@pandas_udf("a string, b float", GROUPED_MAP)
def foo(pdf):
return pd.DataFrame({'x': ['hi'], 'y': [1.0]})
```
2) Data used positionally and columns specified as list of strings that
don't match schema
```python
@pandas_udf("a string, b float", GROUPED_MAP)
def foo(pdf):
return pd.DataFrame([('hi', 1.0)], columns=['x', 'y'])
```
Both of these currently work, but I think (1) is very error prone because
the dict could have reordered the cols (as reported in this JIRA), so we should
not allow this under any circumstance. (2) is not problematic, but I'm not sure
why anyone might do this. Unfortunately, I don't think there is any way to
distinguish between the two. If we decide this should be done with a config,
I'm ok with that but if it's positional by default then a lot of people will
hit this problem and not be able to tell why.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]