[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

BryanCutler Wed, 30 May 2018 16:29:38 -0700

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Here are some examples that currently work, but would no longer work under 
the proposed fix. These are all cases where columns are named with strings, but 
the names do not match the schema (let me know if I've missed any cases):
    
    1) DataFrame constructed with a dict, where column order happens to match 
field type order in schema
    ```python
    @pandas_udf("a string, b float", GROUPED_MAP)
    def foo(pdf):
        return pd.DataFrame({'x': ['hi'], 'y': [1.0]})
    ```
    
    2) Data used positionally and columns specified as list of strings that 
don't match schema
    ```python
    @pandas_udf("a string, b float", GROUPED_MAP)
    def foo(pdf):
        return pd.DataFrame([('hi', 1.0)], columns=['x', 'y'])
    ```
    
    Both of these currently work, but I think (1) is very error prone because 
the dict could have reordered the cols (as reported in this JIRA), so we should 
not allow this under any circumstance. (2) is not problematic, but I'm not sure 
why anyone might do this.  Unfortunately, I don't think there is any way to 
distinguish between the two. If we decide this should be done with a config, 
I'm ok with that but if it's positional by default then a lot of people will 
hit this problem and not be able to tell why.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Reply via email to