Bryan Cutler created SPARK-29748:
------------------------------------

             Summary: Remove sorting of fields in PySpark SQL Row creation
                 Key: SPARK-29748
                 URL: https://issues.apache.org/jira/browse/SPARK-29748
             Project: Spark
          Issue Type: Bug
          Components: PySpark, SQL
    Affects Versions: 3.0.0
            Reporter: Bryan Cutler


Currently, when a PySpark Row is created with keyword arguments, the fields are 
sorted alphabetically. This has created a lot of confusion with users because 
it is not obvious (although it is stated in the pydocs) that they will be 
sorted alphabetically, and then an error can occur later when applying a schema 
and the field order does not match.

The original reason for sorting fields is because kwargs in python < 3.6 are 
not guaranteed to be in the same order that they were entered. Sorting 
alphabetically would ensure a consistent order.  Matters are further 
complicated with the flag {{__from_dict__}} that allows the {{Row}} fields to 
to be referenced by name when made by kwargs, but this flag is not serialized 
with the Row and leads to inconsistent behavior.

This JIRA proposes that any sorting of the Fields is removed. Users with Python 
3.6+ creating Rows with kwargs can continue to do so since Python will ensure 
the order is the same as entered. Users with Python < 3.6 will have to create 
Rows with an OrderedDict or by using the Row class as a factory (explained in 
the pydoc).  If kwargs are used, an error will be raised or it can fall back to 
a LegacyRow that will sort the fields as before. This LegacyRow will be 
immediately deprecated and removed once support for Python < 3.6 is dropped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to