Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

Shane Knapp Thu, 07 Nov 2019 18:55:29 -0800

+1

On Thu, Nov 7, 2019 at 6:08 PM Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
> +1
>
> 2019년 11월 6일 (수) 오후 11:38, Wenchen Fan <cloud0...@gmail.com>님이 작성:
>>
>> Sounds reasonable to me. We should make the behavior consistent within Spark.
>>
>> On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler <cutl...@gmail.com> wrote:
>>>
>>> Currently, when a PySpark Row is created with keyword arguments, the fields 
>>> are sorted alphabetically. This has created a lot of confusion with users 
>>> because it is not obvious (although it is stated in the pydocs) that they 
>>> will be sorted alphabetically. Then later when applying a schema and the 
>>> field order does not match, an error will occur. Here is a list of some of 
>>> the JIRAs that I have been tracking all related to this issue: SPARK-24915, 
>>> SPARK-22232, SPARK-27939, SPARK-27712, and relevant discussion of the issue 
>>> [1].
>>>
>>> The original reason for sorting fields is because kwargs in python < 3.6 
>>> are not guaranteed to be in the same order that they were entered [2]. 
>>> Sorting alphabetically ensures a consistent order. Matters are further 
>>> complicated with the flag _from_dict_ that allows the Row fields to to be 
>>> referenced by name when made by kwargs, but this flag is not serialized 
>>> with the Row and leads to inconsistent behavior. For instance:
>>>
>>> >>> spark.createDataFrame([Row(A="1", B="2")], "B string, A string").first()
>>> Row(B='2', A='1')
>>> >>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1", 
>>> >>> B="2")]), "B string, A string").first()
>>> Row(B='1', A='2')
>>>
>>> I think the best way to fix this is to remove the sorting of fields when 
>>> constructing a Row. For users with Python 3.6+, nothing would change 
>>> because these versions of Python ensure that the kwargs stays in the 
>>> ordered entered. For users with Python < 3.6, using kwargs would check a 
>>> conf to either raise an error or fallback to a LegacyRow that sorts the 
>>> fields as before. With Python < 3.6 being deprecated now, this LegacyRow 
>>> can also be removed at the same time. There are also other ways to create 
>>> Rows that will not be affected. I have opened a JIRA [3] to capture this, 
>>> but I am wondering what others think about fixing this for Spark 3.0?
>>>
>>> [1] https://github.com/apache/spark/pull/20280
>>> [2] https://www.python.org/dev/peps/pep-0468/
>>> [3] https://issues.apache.org/jira/browse/SPARK-29748




-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

Reply via email to