[ 
https://issues.apache.org/jira/browse/SPARK-26810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16759468#comment-16759468
 ] 

Hyukjin Kwon commented on SPARK-26810:
--------------------------------------

Ah, gotya. Yea, it looks the cuase indeed. Sorry thst i rushed to read.

BTW, I think we should better clearly define what to support and unsupport. 
Given my experience so far, and due to the nature of Python, there are many 
holes.. it would be nicer if we can whitelist what we support(what we 
documented).

> Fixing SPARK-25072 broke existing code and fails to show error message
> ----------------------------------------------------------------------
>
>                 Key: SPARK-26810
>                 URL: https://issues.apache.org/jira/browse/SPARK-26810
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.0
>            Reporter: Arttu Voutilainen
>            Priority: Minor
>
> Hey,
> We upgraded Spark recently, and 
> https://issues.apache.org/jira/browse/SPARK-25072 caused our pipeline to fail 
> after the upgrade. Annoyingly, the error message formatting also threw an 
> exception itself, thus hiding the message we should have seen.
> Repro using gettyimages/docker-spark, on 2.4.0:
> {code}
> from pyspark.sql import Row
> r = Row(['a','b'])
> r('1', '2')
> {code}
> {code}
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1505, in __call__
>     "but got %s" % (self, len(self), args))
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1552, in __repr__
>     return "<Row(%s)>" % ", ".join(self)
> TypeError: sequence item 0: expected str instance, list found
> {code}
> On 2.3.1, and also showing how this was used:
> {code}
> from pyspark.sql import Row, types as T
> r = Row(['a','b'])
> df = spark.createDataFrame([Row(col='doesntmatter')])
> rdd = df.rdd.mapPartitions(lambda p: [r('a1','b2')])
> spark.createDataFrame(rdd, T.StructType([T.StructField('a', T.StringType()), 
> T.StructField('b', T.StringType())])).collect()
> {code}
> {code}
> [Row(a='a1', b='b2'), Row(a='a1', b='b2')]
> {code}
> While I do think the code we had was quite horrible, it used to work. The 
> unexpected error came from __repr__ as it assumes that the arguments given to 
> Row constructor are strings. That sounds like a reasonable assumption, should 
> the Row constructor validate that it holds true maybe? (I guess that might be 
> another potentially breaking change though, if someone has as weird code as 
> this one...)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to